There are numerous analyses on the internet and in research papers regarding COVID-19. Data from the pandemic is very useful for creating educational material. The Johns-Hopkins University (JHU) data repository contains large open data sets on the pandemic.

In this notebook, I showcase the use of this data resource. The aims are as follows:

Use the JHU data as teaching material for the R language

Use the JHU data as teaching material for data analysis

Compare data between countries (South Africa, Germany, United Kingdom)

Look ahead at what may happen in South Africa in early 2021

**View the complete RPub document here**

South Africa lags behind in the time line of COVID-19. Cases in South Africa were much higher after the first wave. It may be that the case load will be very high in the first part of 2021.

While we do consider that a current strain of SARS-CoV-2 is more infective, there might be confounding factors as there is great concern about human activities and interactions, especially since the progressive lifting of restrictions. The festive season may worsen upcoming case numbers.

Seroprevalence studies in South Africa are showing a a much higher level of infection than confirmed cases report. Vaccines will take the better part of 2021 to reach large parts of South Africa.

If you are new to R, then perhaps a look at simple univariate data is a good place to start. In this RPubs post, I take a look at both categorical and numerical data. It is quite easy to calculate descriptive statistics of univariate data and to visualize it using plots. Click the link and have a look.

By the way, the file is also available on GitHub.

The World Bank provides open data for many indicators across most countries, spanning the last few decades.

This data is available online with searches available by country codes (iso2c and iso3c), indicator names, and by dates. The indicators can be viewed here. It can also be accessed via an application programming interface (API). The WDI library in R provides access through this API, allowing for easy search and retrieval of data.

In this post, written as an R-markdown file, and available on RPubs and GitHub, I showcase the WDI library by looking at maternal mortality rates for the United States, Brazil, and South Africa.

Follow the links and have a look.

In this post, written as an R-markdown file and posted on RPubs, I discuss the assumptions for the use of parametric tests in R.

Parametric tests such as the various *t* tests, analysis of variance (ANOVA), and correlations are only valid if certain assumptions are met. When these assumptions are not met, the use of these tests in your research may lead to false claims.

In the post I show you the most important assumptions and how to test for them using the R programming language.

The post is available on RPubs and the markdown file is on GitHub.

R is a programming language designed by statisticians for statistical analysis. It is a free programming language and is available for download (Windows, Mac, and Linux).

Bar a few eccentricities, it is quite easy to learn R. We make extensive use of it in the Klopper Research Group, where, alongside other programming languages, I use it to teach my students how to conduct proper data analysis.

I have started to create a series of R markdown files that are published on the Rpubs website . I am also making a series of YouTube videos on the use of R. The first set is on the use of the Plotly library to create interactive HTML widget plots in R.

Logistic regression is a statistical test that uses independent variables (categorical or numerical) to predict a categorical dependent variable. It is based on the principles of linear regression. As the outcome (dependent) variable is categorical, though, logistic regression computes the probability of this variable.

There are many methods of creating and testing the validity of a logistic regression model. In the link is a web page with an explanation of binomial logistic regression and how to use the R programming language to construct and understand your model.

In this post I discuss some of the assumptions that must be met for the use of parametric statistical tests. The post contain snippets in the R statistical programming language to help visualize the concepts and to show how these assumptions are tested. Click on the link above to view the post.

At a recent meeting of fellow surgeons in my department, an interesting difference of opinion arose. It relates to our trainees’ knowledge of statistics. Unfortunately, the meeting did not allow any time to properly discuss the topic.

Some background to illuminate your way. Registration as a medical specialist in South Africa is regulated by the Health Professions Council. In recent years, the Council has introduced the completion of a mandatory research project, culminating in a dissertation. This accompanies the usual prescribed formal examinations.

Universities in the country manage the research projects by way of a Master’s degree, for which all trainees must register.

The difference of opinion was simple. From the opposite corner of the ring, it was suggested that our trainees require no knowledge of statistical analysis and should hand in their data to a statistician and merely use the results in their reports.

I do not share this opinion and feel strongly that all medical professionals should have an understanding of the topic. While not all doctors and specialists are interested in research, I do believe that an understanding of statistics empowers the individual when evaluating published research. This in turns helps to inform and change their practice. As a surgeon, I know it does mine. With no formal program for statistical teaching in our department, I looked towards open education.

To this end, I was a leading proponent in getting the University of Cape Town to sign up with the Coursera and FutureLearn massive open online course platforms. The creation of twelve courses were funded by the Vice Chancellor and my course on Understanding Medical Research was the first to launch on Coursera. It has been a phenomenal experience and the feedback has been tremendous.

Unfortunately, austerity measures have curtailed these efforts. I funded my second course on Coursera through an external loan. It is on the use of Julia (mathematical biology using scientific computing) and was created in collaboration with the Applied Mathematics Department. The honors section of the course is on data management and statistical analysis.

To further my resolve in teaching medical statistics, I have taken to the Udemy platform with a course on medical statistics using Mathematica. In the last few days I have also launched a course on the use of SPSS in healthcare and life science statistics. Udemy is an interesting platform and I would encourage its use.

Link to the course: SPSS for healthcare and life science statistics

My opinion, though, is clear. Learning to analyze data, is an empowering skill for everyone in healthcare.

So, you’ve spent a lot of time and effort in creating your python machine learning model. The parameters have been tweaked and the metrics look great.

Now what? How do you share it with others to use? Well, one easy way it to pickle it. The pickle library in python allows you to write your model as a file, that others can open. They can then simply enter their own data for prediction.

In this YouTube tutorial I create a random forest regressor model, export it as a pickle file, and then import it for use. Have a look at how easy it all is.

The scikit learn library for python is a powerful machine learning tool.

K means clustering, which is easily implemented in python, uses geometric distance to create centroids around which our data can fit as clusters.

In the example attached to this article, I view 99 hypothetical patients that are prompted to sync their smart watch healthcare app data with a research team. The data is recorded continuously, but to comply with healthcare regulations, they have to actively synchronize the data. This example works equally well is we consider 99 hypothetical customers responding to a marketing campaign.

In order to prompt them, several reminder campaigns are run each year. In total there are 32 campaigns. Each campaign consists only of one of the following reminders: e-mail, short-message-service, online message, telephone call, pamphlet, or a letter. A record is kept of when they sync their data, as a marker of response to the campaign.

Our goal is to cluster the patients so that we can learn which campaign type they respond to. This can be used to tailor their reminders for the next year.

In the attached video, I show you just how easy this is to accomplish in python. I use the python kernel in a Jupyter notebook. There will also a mention of dimensionality reduction using principal component separation, also done using scikit learn. This is done so that we can view the data as a scatter plot using the plotly library.

You can view the video here.