In this post I discuss some of the assumptions that must be met for the use of parametric statistical tests. The post contain snippets in the R statistical programming language to help visualize the concepts and to show how these assumptions are tested. Click on the link above to view the post.
At a recent meeting of fellow surgeons in my department, an interesting difference of opinion arose. It relates to our trainees’ knowledge of statistics. Unfortunately, the meeting did not allow any time to properly discuss the topic.
Some background to illuminate your way. Registration as a medical specialist in South Africa is regulated by the Health Professions Council. In recent years, the Council has introduced the completion of a mandatory research project, culminating in a dissertation. This accompanies the usual prescribed formal examinations.
Universities in the country manage the research projects by way of a Master’s degree, for which all trainees must register.
The difference of opinion was simple. From the opposite corner of the ring, it was suggested that our trainees require no knowledge of statistical analysis and should hand in their data to a statistician and merely use the results in their reports.
I do not share this opinion and feel strongly that all medical professionals should have an understanding of the topic. While not all doctors and specialists are interested in research, I do believe that an understanding of statistics empowers the individual when evaluating published research. This in turns helps to inform and change their practice. As a surgeon, I know it does mine. With no formal program for statistical teaching in our department, I looked towards open education.
To this end, I was a leading proponent in getting the University of Cape Town to sign up with the Coursera and FutureLearn massive open online course platforms. The creation of twelve courses were funded by the Vice Chancellor and my course on Understanding Medical Research was the first to launch on Coursera. It has been a phenomenal experience and the feedback has been tremendous.
Unfortunately, austerity measures have curtailed these efforts. I funded my second course on Coursera through an external loan. It is on the use of Julia (mathematical biology using scientific computing) and was created in collaboration with the Applied Mathematics Department. The honors section of the course is on data management and statistical analysis.
To further my resolve in teaching medical statistics, I have taken to the Udemy platform with a course on medical statistics using Mathematica. In the last few days I have also launched a course on the use of SPSS in healthcare and life science statistics. Udemy is an interesting platform and I would encourage its use.
Link to the course: SPSS for healthcare and life science statistics
My opinion, though, is clear. Learning to analyze data, is an empowering skill for everyone in healthcare.
So, you’ve spent a lot of time and effort in creating your python machine learning model. The parameters have been tweaked and the metrics look great.
Now what? How do you share it with others to use? Well, one easy way it to pickle it. The pickle library in python allows you to write your model as a file, that others can open. They can then simply enter their own data for prediction.
In this YouTube tutorial I create a random forest regressor model, export it as a pickle file, and then import it for use. Have a look at how easy it all is.
The scikit learn library for python is a powerful machine learning tool.
K means clustering, which is easily implemented in python, uses geometric distance to create centroids around which our data can fit as clusters.
In the example attached to this article, I view 99 hypothetical patients that are prompted to sync their smart watch healthcare app data with a research team. The data is recorded continuously, but to comply with healthcare regulations, they have to actively synchronize the data. This example works equally well is we consider 99 hypothetical customers responding to a marketing campaign.
In order to prompt them, several reminder campaigns are run each year. In total there are 32 campaigns. Each campaign consists only of one of the following reminders: e-mail, short-message-service, online message, telephone call, pamphlet, or a letter. A record is kept of when they sync their data, as a marker of response to the campaign.
Our goal is to cluster the patients so that we can learn which campaign type they respond to. This can be used to tailor their reminders for the next year.
In the attached video, I show you just how easy this is to accomplish in python. I use the python kernel in a Jupyter notebook. There will also a mention of dimensionality reduction using principal component separation, also done using scikit learn. This is done so that we can view the data as a scatter plot using the plotly library.
You can view the video here.
I note more and more published papers on machine learning. As a clinician, I find it a fascinating way of looking at patient data. In case you are not familiar with machine learning, the definition given over at Wikipedia is: Machine learning is the subfield of computer science that gives computers the ability to learn without being explicitly programmed. …machine learning explores the study and construction of algorithms that can learn from and make predictions on data.
That is exactly what machine learning is used for in medicine as well. In a particular branch of machine learning, called supervised learning, a dataset of predictor variables together with a known outcome variable can be passed to the machine, which in turns constructs a model from the data. A selection of the data is usually kept separately and is used to test the model. Given that the outcomes are know, it is trivial to calculate the accuracy of the model. Once a model is generated, data without a known outcome can be passed to the model, which will predict the outcome. This can indeed be very useful in medicine.
There are many tools available to do machine learning. I use both Python and Mathematica. It is really easy to do. I have put together a short video on YouTube for those familiar with Mathematica, just to show how easy it is.
In the video I use random forest, logistic regression, and support vector machines models to predict the presence of appendicitis from the simulated modified Alvarado score predictor variables.