Finish to Finish ML with GPT-3.5. Learn to use GPT-3.5 to do the… | by Alex Adam | Could, 2023
Learn to use GPT-3.5 to do the heavy lifting for knowledge acquisition, preprocessing, mannequin coaching, and deployment
Loads of repetitive boilerplate code exists within the mannequin growth part of any machine studying software. Well-liked libraries akin to PyTorch Lightning have been created to standardize the operations carried out when coaching/evaluating neural networks, resulting in a lot cleaner code. Nonetheless, boilerplate extends far past coaching loops. Even the info acquisition part of machine studying tasks is stuffed with steps which can be vital however time consuming. One strategy to take care of this problem could be to create a library much like PyTorch Lightning for the whole mannequin growth course of. It must be basic sufficient to work with quite a lot of mannequin sorts past neural networks, and able to integrating quite a lot of knowledge sources.
Code examples for extracting knowledge, preprocessing, mannequin coaching, and deployment is available on the web, although gathering it, and integrating it right into a undertaking takes time. Since such code is on the web, chances are high it has been skilled on by a big language mannequin (LLM) and might be rearranged in quite a lot of helpful methods by means of pure language instructions. The purpose of this publish is to point out how straightforward it’s to automate lots of the steps widespread to ML tasks by utilizing the GPT-3.5 API from OpenAI. I’ll present some failure instances alongside the way in which, and the way to tune prompts to repair bugs when doable. Ranging from scratch, with out even a lot as a dataset, we’ll find yourself with a mannequin that is able to be deployed on AWS SageMaker. In case you’re following alongside, ensure to setup the OpenAI API as follows:
import openai
openai.api_key = "YOUR KEY HERE"
Additionally, the next utility operate is useful for calling the GPT-3.5 API:
def get_api_result(immediate):
request = openai.ChatCompletion.create(
mannequin="gpt-3.5-turbo-0301",
messages=[{"role": "user", "content": prompt}]
)end result = request['choices'][0]['message']['content']
print(end result)
This part is simplified because it solely considers a single knowledge supply, however can in precept be prolonged so conditions the place knowledge comes from a number of sources (csv recordsdata, databases, pictures, and so on.). Step one is to extract some knowledge. For the sake of simplicity, I’ll use the Income Prediction¹ dataset the place the purpose is to foretell if a person earns extra/lower than $50k per 12 months primarily based on their schooling, job place, business, and so on. The operate beneath can be used to generate the code that downloads our knowledge. Be aware how the immediate template is designed to bias the API to generate python code:
def extract(immediate):
prompt_template = """You're a ChatGPT language mannequin that may generate Python code. Please present a pure language enter textual content, and I'll generate the corresponding Python code.nInput: {}nPython code:""".format(immediate)get_api_result(prompt_template)
Right here we ask GPT-3.5 to:
Retrieve the grownup revenue prediction dataset from openml utilizing the sklearn fetch_openml operate. Be certain that to retrieve the info as a single dataframe which incorporates the goal in a column named “goal”. Identify the ensuing dataframe “df”.
which provides us the output:
from sklearn.datasets import fetch_openmladult_income = fetch_openml(identify='grownup', model=2)
df = pd.DataFrame(adult_income.knowledge, columns=adult_income.feature_names)
df['target'] = adult_income.goal
This code is freed from bugs and provides us precisely what we would like. Had I used an easier immediate by eradicating mentions of openml and the operate to make use of for retrieval, we might get:
import pandas as pddf = pd.read_csv("adult_income_prediction.csv")
df.columns = ["age", "workclass", "fnlwgt", "education", "education-num",
"marital-status", "occupation", "relationship", "race",
"sex", "capital-gain", "capital-loss", "hours-per-week",
"native-country", "target"]
This assumes that the info is domestically out there. What’s attention-grabbing about this result’s that it has the right column names as a listing, despite the fact that we didn’t embrace them within the API name! These names however are everywhere in the net, notably in this Medium publish, aside from the goal column which is added by GPT. The subsequent step is to rework the info right into a format that’s usable by machine studying fashions. We’ll have to make use of a extra superior immediate template right here since GPT must be made conscious of the column names and their sorts:
def rework(columns, column_types, immediate):
prompt_template = """You're a ChatGPT language mannequin that may generate Python code. Please present a pure language enter textual content, and I'll generate the corresponding Python code utilizing the Pandas to preprocess the DataFrame. The DataFrame columns are {} and their corresponding dtypes are {}.nInput: {}nPython code:""".format(columns, column_types, immediate)get_api_result(prompt_template)
I then make the next request:
Preprocess the dataframe by changing all categorical columns to their one-hot encoded equivalents, and normalizing numerical columns. Drop rows which have an NA or NaN worth in any column. Drop rows which have numeric column outliers as decided by their z rating. A numeric column outlier is a worth that’s outdoors of the 1 to 99 inter-quantile vary. The numerical columns needs to be normalized utilizing StandardScaler from sklearn. The values within the goal colummn needs to be transformed to 0 or 1 and needs to be of kind int.
We now get:
import pandas as pd
from sklearn.preprocessing import StandardScaler# Load the DataFrame
df = pd.read_csv('knowledge.csv')
# Convert categorical columns to one-hot encoded equivalents
df = pd.get_dummies(df, columns=['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country'])
# Normalize numerical columns
scaler = StandardScaler()
numerical_cols = ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']
df[numerical_cols] = scaler.fit_transform(df[numerical_cols])
# Drop rows with NA or NaN values
df.dropna(inplace=True)
# Drop rows with numeric column outliers
z_scores = df[numerical_cols].apply(lambda x: (x - x.imply()) / x.std())
df = df[(z_scores > -3).all(axis=1) & (z_scores < 3).all(axis=1)]
# Convert goal column to 0 or 1 and alter kind to int
df['target'] = df['target'].apply(lambda x: 1 if x == '>50K' else 0).astype(int)
which is sort of excellent for our use if we exclude the half that tries to load within the knowledge from a csv file. It’s fairly tough to persuade GPT to exclude this, even when we explicitly inform it to imagine that df
exists and shouldn’t be loaded. Lastly, we have to load the info into a neighborhood database. That is overkill for such a easy use case, however is an effective behavior to develop.
def load(immediate):
prompt_template = """You're a ChatGPT language mannequin that may generate Python code. Please present a pure language enter textual content, and I'll generate the corresponding Python code.nInput: {}nPython code:""".format(immediate)get_api_result(prompt_template)
I exploit the next instruction:
Connect with an sqlite database named “knowledge”. Use pandas to insert knowledge from a DataFrame named “df” right into a desk named “revenue”. Don’t embrace the index column. Commit the adjustments earlier than closing the connection.
in an effort to get:
import sqlite3
import pandas as pd# Create a connection to the database
connection = sqlite3.join("knowledge.db")
# Outline the DataFrame
df = pd.DataFrame()
# Outline the desk identify
table_name = "revenue"
# Insert the info into the desk
df.to_sql(table_name, connection, if_exists="change", index=False)
# Commit the adjustments to the database
connection.commit()
# Shut the connection
connection.shut()
which is ideal, minus the pointless creation of df
. With this processed knowledge in hand, we’re prepared to coach some fashions.
Sources allowing, it’s a good suggestion to check out a number of totally different mannequin sorts to establish the one with the correct stage of complexity for the given process. Subsequently, we ask GPT-3.5 to check out a number of totally different fashions. First, let’s arrange the generic immediate template for mannequin coaching:
def prepare(immediate):
prompt_template = """You're a ChatGPT language mannequin that may generate Python code. Deal with utilizing scikit-learn when relevant. Please present a pure language enter textual content, and I'll generate the corresponding Python code.nInput: {}nPython code:""".format(immediate)get_api_result(prompt_template)
and the immediate we’ll be utilizing is:
Practice quite a lot of classification fashions to foretell the “goal” column utilizing all different columns. Accomplish that utilizing 5-fold cross validation to decide on the perfect mannequin and corresponding set of hyperparameters, and return the perfect general mannequin and corresponding hyperparameter settings. Select the perfect mannequin primarily based on accuracy. Assume a dataframe named “df” exists which is for use for coaching. Log the whole course of utilizing MLFlow. Begin logging with mlflow earlier than coaching any fashions so solely a single run is saved. Make it possible for the mannequin is logged utilizing the sklearn module of mlflow. Make it possible for solely the perfect general mannequin is logged, however log metrics for all mannequin sorts. The imply worth of the next metrics on all cross validation folds needs to be logged: accuracy, AUC, F1 rating
Let’s take a look on the generated output and this time go deeper into why among the particular directions needed to be supplied.
import pandas as pd
from sklearn.model_selection import GridSearchCV, cross_validate
from sklearn.metrics import accuracy_score, roc_auc_score, f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import mlflow
import mlflow.sklearn# load the dataframe
df = pd.read_csv('path/to/dataframe.csv')
# outline the options and goal columns
options = df.columns.drop('goal')
goal = 'goal'
# arrange mlflow
mlflow.set_tracking_uri('localhost:5000') # or arrange your personal server
mlflow.set_experiment("my_experiment")
# arrange cross-validation
cv = 5
# outline the fashions and corresponding hyperparameters to be tuned
fashions = [
{
'name': 'Logistic Regression',
'estimator': LogisticRegression(),
'hyperparameters': {
'penalty': ['l1', 'l2'],
'C': [0.1, 1, 10]
}
},
{
'identify': 'Determination Tree',
'estimator': DecisionTreeClassifier(),
'hyperparameters': {
'max_depth': [2, 4, 6, 8]
}
},
{
'identify': 'Random Forest',
'estimator': RandomForestClassifier(),
'hyperparameters': {
'n_estimators': [10, 25, 50, 100],
'max_depth': [None, 2, 4, 6, 8]
}
}
]
# iterate by means of the fashions and corresponding hyperparameters to coach and tune them
with mlflow.start_run():
best_model = None
best_metrics = {}
for mannequin in fashions:
print(f"Coaching {mannequin['name']} mannequin...")
# arrange grid seek for hyperparameter tuning
gs = GridSearchCV(mannequin['estimator'], mannequin['hyperparameters'], cv=cv, verbose=2, n_jobs=-1, scoring='accuracy')
gs.match(df[features], df[target])
# log metrics for all mannequin sorts
metrics = cross_validate(gs.best_estimator_, df[features], df[target], cv=cv, n_jobs=-1,
scoring=["accuracy", "roc_auc", "f1"])
for metric, values in metrics.gadgets():
mlflow.log_metric(f"{mannequin['name']}_{metric}", values.imply())
# test if that is the perfect mannequin primarily based on accuracy
if best_model is None or gs.best_score_ > best_metrics['accuracy']:
best_model = gs.best_estimator_
best_metrics = {'accuracy': gs.best_score_,
'AUC': roc_auc_score(df[target], best_model.predict_proba(df[features])[:, 1]),
'F1': f1_score(df[target], best_model.predict(df[features]))
}
# register the perfect mannequin in mlflow
mlflow.sklearn.log_model(sk_model=best_model, artifact_path="best_model")
# log the perfect mannequin's metrics
for metric, worth in best_metrics.gadgets():
mlflow.log_metric(f"best_model_{metric}", worth)
If we take away the loading of df
and the part # arrange mlflow
, we find yourself with precisely what’s desired. Particularly, a loop over a 3 totally different mannequin sorts, performing a grid search utilizing 5-fold cross validation to establish the perfect hyperparmeters for the given mannequin kind, whereas preserving observe of metrics. With out specifying “select the perfect mannequin primarily based on accuracy”, the generated code will use scoring=[“accuracy”, “roc_auc", “f1”]
for the grid search which is not going to work since there’s ambiguity as to the way to choose the perfect mannequin in line with a number of metrics. With out “ensure that the mannequin is logged utilizing the sklearn module of mlflow”, we typically find yourself with mlflow.log_model()
which is flawed. Additionally, “ensure that solely the perfect general mannequin is logged” is critical to keep away from storing all fashions. Total, this output is suitable, however it’s unstable, and working it a number of instances is more likely to introduce totally different bugs. So as to have all the pieces prepared for the serving step, it’s helpful so as to add the mannequin signature when saving the mannequin. This signature is mainly a set of characteristic names and their corresponding sorts. It’s a ache to get GPT-3.5 so as to add this, so some guide labor must be accomplished by first including the import:
from mlflow.fashions.signature import infer_signature
after which modifying the road of code which logs the mannequin by way of:
mlflow.sklearn.log_model(sk_model=best_model, artifact_path="best_model", signature=infer_signature(df[features], best_model.predict(df[features])))
Since we used MLflow to log the perfect mannequin, we’ve got a few choices to serve the mannequin. The best choice is to host the mannequin domestically. Let’s first design the final immediate template for mannequin serving:
def serve_model(model_path, immediate):
prompt_template = """You're a ChatGPT language mannequin that may generate shell code for deploying fashions utilizing MLFlow. Please present a pure language enter textual content, and I'll generate the corresponding command to deploy the mannequin. The mannequin is positioned within the file {}.nInput: {}nShell command:""".format(model_path, immediate)get_api_result(prompt_template)
and the immediate can be:
Serve the mannequin utilizing port quantity 1111, and use the native atmosphere supervisor
By calling serve_model("<mannequin path right here>", query)
we get:
mlflow fashions serve -m <mannequin path right here> -p 1111 --no-conda
As soon as we run this command within the shell, we’re able to make predictions by sending knowledge encoded as JSON to the mannequin. We’ll first generate the command to ship knowledge to the mannequin, after which create the JSON payload to be inserted into the command.
def send_request(immediate):
prompt_template = """You're a ChatGPT language mannequin that may generate code for sending knowledge to deployed MLFlow fashions. Please present a pure language enter textual content, and I'll generate the corresponding command. nInput: {}nCommand:""".format(immediate)get_api_result(prompt_template)
The next request can be inserted into the immediate template in send_request()
:
Use the “curl” command to ship knowledge “<knowledge right here>” to an mlflow mannequin hosted at port 1111 on localhost. Make it possible for the content material kind is “software/json”.
The output generated by GPT-3.5 is:
curl -X POST -H "Content material-Kind: software/json" -d '<knowledge right here>' http://localhost:1111/invocations
It’s preferable to have the URL instantly after curl
as an alternative of being on the very finish of the command, i.e.
curl http://localhost:1111/invocations -X POST -H "Content material-Kind: software/json" -d '<knowledge right here>'
Getting GPT-3.5 to do that is just not straightforward. Each of the next requests fail to take action:
Use the “curl” command to ship knowledge “<knowledge right here>” to an mlflow mannequin hosted at port 1111 on localhost. Place the URL instantly after “curl”. Make it possible for the content material kind is “software/json”.
Use the “curl” command, with the URL positioned earlier than any argument, to ship knowledge “<knowledge right here>” to an mlflow mannequin hosted at port 1111 on localhost. Make it possible for the content material kind is “software/json”.
Possibly it’s doable to get the specified output if we’ve got GPT-3.5 modify an present command somewhat than generate one from scratch. Right here is the generic template for modifying instructions:
def modify_request(immediate):
prompt_template = """You're a ChatGPT language mannequin that may modify instructions for sending knowledge utilizing "curl". Please present a pure language instruction, corresponding command, and I'll generate the modified command. nInput: {}nCommand:""".format(immediate)get_api_result(prompt_template)
We are going to name this operate as follows:
code = """curl -X POST -H "Content material-Kind: software/json" -d '<knowledge right here>' http://localhost:1111/invocations"""
immediate = """Please modify the next by putting the url earlier than the "-X POST" argument:n{}""".format(code)
modify_request(immediate)
which lastly provides us:
curl http://localhost:1111/invocations -X POST -H "Content material-Kind: software/json" -d '<knowledge right here>'
Now time to create the payload:
def create_payload(immediate):
prompt_template = """You're a ChatGPT language mannequin that may generate code for sending knowledge to deployed MLFlow fashions. Please present a pure language enter textual content, and I'll generate the corresponding command. nInput: {}nPython code:""".format(immediate)get_api_result(prompt_template)
The immediate for this half wanted fairly a little bit of tuning to get the specified output format:
Convert the DataFrame “df” to json format that may be obtained by a deployed MLFlow mannequin. Wrap the ensuing json in an object known as “dataframe_split”. The ensuing string shouldn’t have newlines, and it shouldn’t escape quotes. Additionally, “dataframe_split” needs to be surrounded by doubles quotes as an alternative of single quotes. Don’t embrace the “goal” column. Use the break up “orient” argument
With out the express instruction to keep away from newlines and escaping quotes, a name to json.dumps()
is made which isn’t the format that the MLflow endpoint expects. The generated command is:
json_data = df.drop("goal", axis=1).to_json(orient="break up", double_precision=15)
wrapped_data = f'{{"dataframe_split":{json_data}}}'
Earlier than changing <knowledge right here>
within the curl
request with the worth of wrapped_data
, we most likely need to ship only some rows of information for prediction, in any other case the ensuing payload is just too massive. So we modify the above to be:
json_data = df[:5].drop("goal", axis=1).to_json(orient="break up", double_precision=15)
wrapped_data = f'{{"dataframe_split":{json_data}}}'
Invoking the mannequin provides:
{"predictions": [0, 0, 0, 1, 0]}
whereas the precise targets are [0, 0, 1, 1, 0].
There we’ve got it. Originally of this publish, we didn’t even have entry to a dataset, but we’ve managed to finish up with a deployed mannequin that was chosen to be the perfect by means of cross-validation. Importantly, GPT-3.5 did all of the heavy lifting, and solely required minimal help alongside the way in which. I did nevertheless must specify specific libraries to make use of and strategies to name, however this was primarily required to resolve ambiguities. Had I specified “Log the whole course of” as an alternative of “Log the whole course of utilizing MLFlow”, GPT-3.5 would have too many libraries to select from, and the ensuing mannequin format may not have been helpful for serving with MLflow. Thus, some information of the instruments used to carry out the assorted steps within the ML pipeline is required to have success utilizing GPT-3.5, however it’s minimal in comparison with the information required to code from scratch.
Another choice for serving the mannequin is to host it as a SageMaker endpoint on AWS. Regardless of how straightforward this may increasingly look on the MLflow website, I guarantee you that as with many examples on the net involving AWS, issues will go flawed. Initially, Docker should be put in in an effort to generate the Docker Imager utilizing the command:
mlflow sagemaker build-and-push-container
Second, the Ptyhon library boto3
used to speak with AWS additionally requires set up. Past this, permissions should be correctly setup such that SageMaker, ECR, and S3 companies can talk with one another on behalf of your account. Listed here are the instructions I ended up having to make use of:
mlflow deployments run-local -t sagemaker -m <mannequin path> --name income_classifier
mlflow deployments create -t sagemaker --name income_classifier -m mannequin/ --config image_url=<docker picture url> --config bucket=mlflow-serving --config region_name=us-east-1
together with some guide tinkering behind the scenes to get the S3 bucket to be within the right area.
With the assistance of GPT-3.5 we went by means of the ML pipeline in a (principally) painless manner, although the final mile was a bit trickier. Be aware how I didn’t use GPT-3.5 to generate the instructions for serving the mannequin on AWS. It really works poorly for this use case, and creates made up argument names. I can solely speculate that switching to the GPT-4.0 API would assist resolve among the above bugs, and result in a good simpler mannequin growth expertise.
Whereas the ML pipeline might be totally automated utilizing LLMs, it isn’t but secure to have a non-expert be liable for the method. The bugs within the above code have been simply recognized as a result of the Python interpreter would throw errors, however there are extra delicate bugs that may be dangerous. For instance, the elimination of outlier values within the preprocessing code may very well be flawed resulting in extra or inadequate samples being discarded. Within the worst case, it may inadvertently drop complete subgroups of individuals, exacerbating potential equity points.
Moreover, the grid search over hyperparameters may have been accomplished over a poorly chosen vary, resulting in overfitting or underfitting relying on the vary. This might be fairly difficult to establish for somebody with little ML expertise because the code in any other case appears right, however an understanding of how regularization works in these fashions is required. Thus, it isn’t but acceptable to have an unspecialized software program engineer stand in for an ML engineer, however that point is quick approaching.
[1] Dua, D. and Graff, C. (2019). UCI Machine Studying Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: College of California, Faculty of Data and Laptop Science. (CC BY 4.0)