There and Again Once more…a RAPIDS Story

By Kris Manohar & Kevin Baboolal


There and Back Again…a RAPIDS Tale
Picture by Editor


Editor’s Notice:We’re thrilled to announce that this put up has been chosen because the winner of KDnuggets & NVIDIA’s Weblog Writing Contest.



Machine studying has revolutionized numerous domains by leveraging huge quantities of knowledge. Nevertheless, there are conditions the place buying ample information turns into a problem as a consequence of value or shortage. In such instances, conventional approaches usually wrestle to supply correct predictions. This weblog put up explores the restrictions posed by small datasets and unveils an modern resolution proposed by TTLAB that harnesses the ability of the closest neighbor strategy and a specialised kernel. We’ll delve into the main points of their algorithm, its advantages, and the way GPU optimization accelerates its execution.



In machine studying, having a considerable quantity of knowledge is essential for coaching correct fashions. Nevertheless, when confronted with a small dataset comprising only some hundred rows, the shortcomings turn into evident. One frequent subject is the zero frequency downside encountered in some classification algorithms such because the Naive Bayes Classifier. This happens when the algorithm encounters an unseen class worth throughout testing, resulting in a zero chance estimation for that case. Equally, regression duties face challenges when the take a look at set accommodates values that have been absent within the coaching set. Chances are you’ll even discover your selection of algorithm is improved (although sub-optimal)  when these lacking options are excluded.  These points additionally manifest in bigger datasets with extremely imbalanced lessons.



Though train-test splits usually mitigate these points, there stays a hidden downside when coping with smaller datasets. Forcing an algorithm to generalize primarily based on fewer samples can result in suboptimal predictions. Even when the algorithm runs, its predictions could lack robustness and accuracy. The straightforward resolution of buying extra information isn’t at all times possible as a consequence of value or availability constraints. In such conditions, an modern strategy proposed by TTLAB proves to be sturdy and correct.



TTLAB’s algorithm tackles the challenges posed by biased and restricted datasets. Their strategy entails taking the weighted common of all rows within the coaching dataset to foretell the worth for a goal variable in a take a look at pattern. The important thing lies in adjusting the weights of every coaching row for each take a look at row, primarily based on a parameterized non-linear operate that calculates the space between two factors within the characteristic house. Though the weighting operate used has a single parameter (the speed of decay of affect of a coaching pattern as its distance from the take a look at pattern will increase), the computing effort to optimize over this parameter might be massive. By contemplating the whole coaching dataset, the algorithm delivers sturdy predictions. This strategy has proven outstanding success in enhancing the efficiency of standard fashions reminiscent of random forests and naive Bayes. Because the algorithm features recognition, efforts are underway to additional improve its effectivity. The present implementation entails tuning the hyperparameter kappa, which requires a grid search. To expedite this course of, a successive quadratic approximation is being explored, promising sooner parameter optimization. Moreover, ongoing peer evaluations goal to validate and refine the algorithm for broader adoption.

To implement the TTLAB algorithm for classification for loops and numpy proved inefficient leading to very lengthy runtimes. The CPU implementation showcased within the linked publication focuses on classification issues, demonstrating the flexibility and efficacy of the strategy. The publication additionally reveals that the algorithm advantages tremendously from vectorization, hinting at additional pace enhancements that may be gained from GPU acceleration with CuPy. In truth, to carry out hyper-parameter tuning and random Okay-folds for consequence validation would have taken weeks for the multitude of datasets being examined. By leveraging the ability of GPUs, the computations have been distributed successfully, leading to improved efficiency.



Even with optimizations like vectorization and .apply refactoring, the execution time stays impractical for real-world functions. Nevertheless, with GPU optimization, the runtime is drastically decreased, bringing execution occasions down from hours to minutes. This outstanding acceleration opens up prospects for utilizing the algorithm in situations the place immediate outcomes are important.

Following the teachings learnt from the CPU implementation, we tried to additional optimize our implementation. For this, we moved up the layer to CuDF Dataframes. Vectorizing calculations onto the GPU is a breeze with CuDF. For us, it was so simple as altering import pandas to import CuDF (you should vectorize correctly in pandas.)

train_df["sum_diffs"] = 0
train_df["sum_diffs"] = train_df[diff_cols].sum(axis=1).values
train_df["d"] = train_df["sum_diffs"] ** 0.5
train_df["frac"] = 1 / (1 + train_df["d"]) ** kappa
train_df["part"] = train_df[target_col] * train_df["frac"]
test_df.loc[index, "pred"] = train_df["part"].sum() / train_df["frac"].sum()


Additional down our rabbit gap we have to depend on Numba kernels. At this level, issues get tough. Recall why the algorithm’s predictions are sturdy as a result of every prediction makes use of all of the rows within the coaching dataframe. Nevertheless,  the Numba kernels don’t help passing CuDF dataframes. Proper now the we’re experimenting with some methods steered on Github to deal with this case. (

For now, we are able to at the very least go of the uncooked compute to a numba kernel by way of the .apply_rows

def predict_kernel(F, T, numer, denom, kappa):
    for i, (x, t) in enumerate(zip(F, T)):
        d = abs(x - t)  # the space measure
        w = 1 / pow(d, kappa)  # parameterize non-linear scaling
        numer[i] = w
        denom[i] = d

_tdf = train_df[[att, target_col]].apply_rows(
    incols={att: "F", "G3": "T"},
    outcols={"numer": np.float64, "denom": np.float64},
    kwargs={"kappa": kappa},

p = _tdf["numer"].sum() / _tdf["denom"].sum()  # prediction - weighted common


At this level, we didn’t eradicate all for loops, however merely pushing many of the quantity crunching to Numba decreased the CuDf runtime > 50% touchdown us in across the 2 to 4 seconds for the usual 80-20 train-test break up.



It has been an exhilarating and pleasant journey exploring the capabilities of the rapids, cupy, and cudf libraries for numerous machine studying duties. These libraries have confirmed to be user-friendly and simply comprehensible, making it accessible to most customers. The design and upkeep of those libraries are commendable, permitting customers to dive deep into the intricacies when vital. In only a few hours a day over the course of every week, we have been capable of progress from being novices to pushing the boundaries of the library by implementing a extremely personalized prediction algorithm. Our subsequent goal is to realize unprecedented pace, aiming to interrupt the micro-second barrier with massive datasets starting from 20K to 30K. As soon as this milestone is reached, we plan to launch the algorithm as a pip package deal powered by rapids, making it accessible for wider adoption and utilization.

Kris Manohar is a government director at ICPC, Trinidad and Tobago.

Leave a Reply

Your email address will not be published. Required fields are marked *