How Did We Get to ML Mannequin Reproducibility

When engaged on real-world ML projects, you come face-to-face with a collection of obstacles. The ml mannequin reproducibility drawback is one in every of them.

This text goes to take you thru an experience-based, step-by-step method to unravel the ml reproducibility problem taken by my ML workforce engaged on a fraud detection system for the insurance coverage area.

You’ll study:

  • 1
    Why is reproducibility vital in machine studying?
  • 2
    What have been the challenges confronted by the workforce?
  • 3
    What was the answer? (device stack and a guidelines)

Let’s begin initially!

Why is reproducibility vital in machine studying?

To higher perceive this idea, I’ll share with you the journey of me and my workforce. 

Venture background

Earlier than discussing the small print, let me let you know slightly concerning the challenge. This ML-based challenge was a fraud detection system for the insurance coverage area the place a classification mannequin was used to categorise if an individual is vulnerable to commit fraud or not, given the required particulars as enter. 

Initially, after we begin engaged on any challenge, we don’t take into consideration mannequin deployment, reproducibility, mannequin retraining, and so forth. As an alternative, we are likely to spend a lot time on information exploration, preprocessing, and modeling. That is certainly an inaccurate factor to do when engaged on ML initiatives at scale. To again this up, right here is the Nature survey conducted in 2016

In keeping with this survey, 1,500 scientists have been chosen for a reproducibility check, but 70% of them have been unable to duplicate the experiments of different scientists, and greater than 50% have been unable to duplicate their very own experiments. Retaining this and some different particulars in thoughts, we created a challenge that was reproducible and deployed it efficiently to manufacturing. 

When engaged on this classification challenge, we realized that reproducibility is just not solely important for constant outcomes but additionally for these causes:

  • Steady ML Outcomes and Practices: To guarantee that our fraud detection mannequin outcomes are simply trusted by the shoppers, we needed to guarantee that we now have steady outcomes. Reproducibility is the important thing issue relating to stabilizing the outcomes of any ML pipeline. For reproducibility, we used an an identical dataset and pipeline in order that the identical outcomes might be produced by anybody in our workforce operating the mannequin. However to make sure that our information and pipeline parts remained the identical through the runs, we needed to observe them utilizing totally different MLOps instruments. 

For instance, we used code versioning instruments, mannequin versioning instruments, and information versioning instruments that helped us to maintain observe of every part within the ML pipeline. Additionally, these instruments enabled excessive collaboration amongst our workforce members and ensured that one of the best practices have been adopted through the improvement. 

  • Promotes Accuracy and Effectivity: One factor that we emphasised probably the most was that we needed our mannequin to generate the identical outcomes many times, regardless of after we ran it. As any reproducible mannequin offers the identical ends in each run, we simply needed to guarantee that we didn’t make any modifications to the mannequin configuration and hyperparameters each time we ran the mannequin. This has helped us to determine one of the best mannequin out of all that we now have tried. 
  • Prevents Duplication of Efforts: One main problem earlier than us whereas growing this classification challenge was that we needed to guarantee that each time one in every of our workforce members runs a challenge, they needn’t do all of the configurations from scratch to attain the identical outcomes each time. Additionally, if any new developer joins our challenge, they will simply perceive the pipeline to generate the identical mannequin. That is the place model management instruments and documentation helped us as workforce members, and new joiners had entry to particular variations of code, information, and ML fashions.
  • Allows Bug-Free ML Pipeline Growth: There have been occasions when operating the identical classification mannequin didn’t produce the identical outcomes, which helped us discover the errors and bugs simply in our pipeline. As soon as recognized we have been in a position to repair these points shortly to make our pipeline steady. 

Each ML reproducibility problem we confronted

Now that you already know about reproducibility and its totally different advantages, it’s time to talk about the key reproducibility points that my workforce and I confronted through the improvement of this ML challenge. The vital half is, all these challenges are quite common for any sort of ML or DL use case. 

1. Lack of clear documentation

One main half that we have been lacking out on initially was the documentation. Initially, after we didn’t have any documentation, it impacted our workforce members’ efficiency as they took extra time than anticipated to know the necessities and implement new options. It additionally turned very tough for the brand new builders on our workforce to know the entire challenge. Attributable to this lack of documentation, a regular method was lacking which led to a failure to breed the identical outcomes each time they ran the mannequin. 

You may contemplate documentation a bridge between the conceptual understanding of a challenge and the precise technical implementation of that challenge. Documentation helps present builders and new workforce members to know the nuance of the answer and helps them to know the construction of the challenge higher. 

2. Completely different pc environments

It’s typically doable for various builders in a workforce to have totally different environments like working programs (OSs), language variations, library variations, and so forth. We had the identical situation whereas engaged on the challenge. This affected our reproducibility as every setting has some vital modifications to the others by way of totally different library variations or alternative ways of package deal implementation and so forth. 

It’s a frequent follow to share code and artifacts amongst totally different workforce members for any ML challenge. So a slight change within the pc setting can create points in operating the present challenge and finally builders will spend pointless time debugging the identical code many times. 

3. Not monitoring information, code, and workflow

Reproducibility in ML is just doable while you use the identical information, code, and preprocessing steps. However not conserving observe of these items may result in totally different configurations used to run the identical mannequin which can end in totally different outputs in every run. So sooner or later in your challenge, it’s good to retailer all this info with the intention to retrieve them each time wanted.

When engaged on the classification challenge, we didn’t preserve observe of all of the fashions and their totally different hyperparameters at first, which turned out to be a barrier for our challenge to attain reproducibility.

4. Lack of ordinary analysis metrics and protocols

Choosing the precise analysis metric is among the doable challenges whereas engaged on any classification use case. You’ll want to resolve on the metrics that may work greatest to be used instances. For instance, within the fraud detection use case, our mannequin couldn’t afford to foretell plenty of False Negatives for which we tried to enhance the recall of the general system. Not utilizing a regular metric can scale back readability amongst workforce members concerning the goal and finally it might probably have an effect on reproducibility. 

Lastly, we needed to guarantee that all of our workforce members adopted the identical protocols and code requirements in order that there was uniformity within the code which made the code extra readable and comprehensible. 

How to Solve Reproducibility in ML

Machine studying reproducibility guidelines: options we tailored

As ML engineers we guarantee that each drawback ought to have one or a number of doable options, as is the case for ML reproducibility challenges. Regardless that there have been plenty of challenges for reproducibility in our challenge, we have been in a position to clear up all of them with the precise technique and a righteous number of instruments. Let’s have a look now on the machine studying reproducibility guidelines we now have used. 

1. Clear documentation of the answer

Our fraud detection challenge was the mixture of a number of particular person technical parts and the mixing amongst them. It was very exhausting for us to recollect in phrases when and the way what element can be utilized by which course of. So for our challenge, we created a doc containing details about every particular module that we now have labored on for instance, information assortment, information preprocessing and exploration, modeling, deployment, monitoring, and so forth. 

Documenting what answer methods we now have tried out or will probably be attempting out, what instruments and applied sciences we might be utilizing all through the challenge, what implementation choices have been taken, and so forth. helped our ML builders higher perceive the ML challenge. With this correct documentation, they have been in a position to comply with the usual greatest practices, and step-by-step process to run the pipeline, and eventually, they knew which error wanted what sort of decision. This resulted in reproducing the identical outcomes each time our workforce members ran the mannequin and helped us enhance the general effectivity.

Additionally, this helped us enhance the effectivity of our workforce as we didn’t must spend time explaining all the workflow to the brand new joiners and different builders as every part was simply talked about within the doc.

2. Utilizing the identical pc environments

Creating the classification answer wanted our ML builders to collaborate and work on the totally different sections of the ML pipeline. And since most of our builders have been utilizing totally different computing environments, it was exhausting for them to provide the identical outcomes because of numerous dependency modifications. So, for reproducibility, we needed to guarantee that every developer was utilizing the identical computing setting, library variations, language variations, and so forth. 

PIP and virtual environments
PIP and digital environments | Source

Utilizing a Docker container or making a shareable virtual environment are two of one of the best options for utilizing the identical computational environments. In our workforce, individuals have been engaged on Home windows and Unix environments, and totally different language and library variations, utilizing the docker containers solved our drawback and helped us to get to reproducibility.     

3. Monitoring information, code, and workflow

Versioning information and workflow 

As we knew, information was the skeleton of our fraud detection use case, if we made a slight change within the dataset, it might have an effect on our mannequin’s reproducibility. The information that we have been utilizing for our use case was not within the required form and format to coach the mannequin. So we needed to apply totally different information preprocessing steps like NaN value removal, Feature Generation, Feature Encoding, Feature Scaling, and so forth. to make this information suitable with the chosen mannequin. 

Because of this, we had to make use of information versioning instruments like, Pachyderm, or DVC, which may also help us systematically handle our information. You may watch this tutorial to see the way it’s solved in Neptune: how to version and compare datasets.

Additionally, we didn’t wish to repeat all the information processing steps each time we ran the ML pipeline so utilizing such information and workflow administration instruments helped us retrieve any particular model of preprocessed information for the ML pipeline run.

Best 7 Data Version Control Tools That Improve Your Workflow With Machine Learning Projects

Code versioning and administration

In the course of the improvement, we needed to make a number of code modifications for ML modules implementation, new options implementation, integration, testing, and so forth. For reproducibility, we needed to guarantee that we used the identical code model each time we ran a pipeline. 

There are a number of instruments to model management your complete code, a number of the well-liked ones are GitHub and Bitbucket. We’ve used GitHub for our use case to model management all the codebase, additionally, this device made the workforce collaboration fairly straightforward as builders had entry to every commit made by different builders. Code versioning instruments made it straightforward for us to make use of the identical code each time we ran an ML pipeline. 

Experiment monitoring in ML 

Lastly, crucial a part of making our ML pipeline reproducible was to trace all of the fashions and experiments that we had tried out all through all the ML lifecycle. When engaged on the classification challenge we tried totally different ML fashions and hyperparameter values, it was very exhausting to maintain observe of them manually or with documentation. To resolve this difficulty, we determined to choose one that would clear up a number of issues. Though there are a number of instruments accessible for monitoring your complete code, information, and workflow. However as a substitute of selecting a unique device for every of those duties, appeared like the precise answer. 

It’s a cloud-based platform designed to assist information scientists with experiment tracking, data versioning, model versioning, and metadata store. It gives a centralized location for all these actions, making it simpler for groups to collaborate on initiatives and guaranteeing that everybody is working with probably the most up-to-date info.

Instruments like, Comet, MLFlow, and so forth. allow builders to entry any particular model of the mannequin in order that they will resolve on which algorithm has labored out greatest for them and with what hyperparameters. Once more, it is determined by your use case and workforce dynamics – which device you resolve to go forward with.

4. Deciding on normal analysis metrics and protocols

As we have been engaged on a classification challenge and in addition had an imbalanced dataset, we needed to resolve on the metrics that would work nicely for us. Accuracy doesn’t come out as an excellent measure for the imbalance dataset so we couldn’t use it. We needed to resolve amongst Precision, Recall, AUC-ROC curve, and so forth.

In a fraud detection use case, precision and recall each are given significance. It is because false positives may cause inconvenience and annoyance to clients, and doubtlessly injury the repute of the enterprise. Nevertheless, false negatives will be way more damaging and end in vital monetary losses. So we determined to maintain Recall as our essential metric for the use case.

Additionally, we determined to make use of the PEP8 normal for coding as we needed our code to be uniform amongst all of the parts that we have been growing. Selecting a single metric to deal with and PEP8 for normal coding practices helped us write simply reproducible code.


After studying this text, you now know that reproducibility is a vital issue when engaged on ML use instances. With out reproducibility, it might be exhausting for anybody to belief your findings and outcomes. I’ve additionally walked you thru the significance of reproducibility with a private expertise, and in addition shared a number of the challenges that I and my workforce confronted and the proposed options. 

If it’s good to bear in mind one factor from this text, it could be to make use of specialised instruments and providers to model management every doable factor like Knowledge, Pipeline, Mannequin, and totally different experiments. This lets you use any particular model and run all the pipeline to get the identical outcomes each time. 



Leave a Reply

Your email address will not be published. Required fields are marked *