I am not a data scientist. And while I’m familiar with a Jupyter notebook and have written a fair amount of Python code, I don’t claim to be an expert in machine learning. So when I played the first part of our no-code/low-code machine learning experiment and got over 90% accuracy on a model I suspected I had done something wrong.
If you haven’t been following so far, here’s a quick overview before we refer you to the first two articles in this series. To see how far machine learning tools for the rest of us had come – and to redeem myself the unwinnable task I had been assigned to machine learning last year – I took a well-worn heart attack dataset from an archive at the University of California, Irvine and tried to top student results in data science using the “easy button” of Amazon Web Services. low-code and no-code tools.
The whole point of this experiment was to see:
- If a relative novice can use these tools effectively and accurately
- If tools were more profitable than finding someone who knew what they were doing and giving it to them
That’s not exactly a true picture of how machine learning projects usually go. And as I found out, the “no code” option provided by Amazon Web Services—SageMaker Canvas— is intended to work hand-in-hand with the more data science-driven approach of SageMakerStudio. But Canvas outperformed what I was able to do with Studio’s low-code approach, but probably because of my less skilled hands in data management.
Robot work evaluation
Canvas allowed me to export a shareable link that opened the model I created with my full version from the 590+ rows of patient data from the Cleveland Clinic and the Hungarian Heart Institute. This link gave me a bit more information about what happened inside the very black box of Canvas with Studio, a Based on Jupyter platform for doing data science and machine learning experiments.
As its name slyly suggests, Jupyter is based on Python. It is a web interface to a container environment that allows you to build kernels based on different Python implementations, depending on the task.
Cores can be populated with any modules the project needs when performing code-focused explorations, such as the Python Data Analysis Library (panda) and SciKit-Learn (learned). I used a local version of Jupyter Lab to do most of my initial data analysis to save on AWS compute time.
The Studio environment created with the Canvas link included pre-built content giving a preview of the produced Canvas template, which I briefly discussed in the last post:
Some of the details included the hyperparameters used by the best-fitting version of the model created by Canvas:
Hyperparameters are adjustments AutoML has made to the algorithm’s calculations to improve accuracy, as well as some basic operations: SageMaker instance parameters, tuning metric (“F1”, which we’ll discuss in a moment ) and other entries. This is all pretty standard for a binary classification like ours.
The model overview in Studio provided basic information about the model produced by Canvas, including the algorithm used (XGBoost) and the relative importance of each of the columns scored with something called SHAP values. SHAP is a really horrible acronym that stands for “SHapley Additive exPlanations”, which is a game theorymethod based on extracting the contribution of each data entity to a change in the model output. It turns out that the “maximum heart rate reached” had a negligible impact on the model, while thalassemia (“thall”) and angiography results (“caa”) – data points for which we had significant missing data – were more impactful than I expected. I couldn’t let them down, apparently. So I uploaded a performance report for the model to get more detailed information about the model’s performance: