This Code Pattern will focus on and guide you through how to use
scikit learn and
python in Watson Studio to predict opioid prescribers based off of a 2014 kaggle dataset.
Opioid prescriptions and overdoses are becoming an increasingly overwhelming problem for the United States, even causing a declared state of emergency in recent months. Though we, as data scientists, may not be able to single handedly fix this problem, we can dive into the data and figure out what exactly is going on and what may happen in the future given current circumstances.
This Code Pattern aims to do just that: it dives into a kaggle dataset which looks at opioid overdose deaths by state as well as different, unique physicians, their credentials, specialties, whether or not they've prescribed opioids in 2014 as well as the specific names of the prescriptions they have prescribed. Follow along to see how to explore the data in a Watson Studio notebook, visualize a few initial findings in a variety of ways, including geographically, using Pixie Dust. Pixie Dust is a great library to use when you need to explore your data visually very quickly. It literally only needs one line of code! Once that initial exploration is complete, this Code Pattern uses the machine learning library, scikit learn, to train several models and figure out which have the most accurate predictions of opioid prescriptions. Scikit learn, if you're unfamiliar, is a machine learning library, which is commonly used by data scientists due to its ease of use. Specifically, by using the library you're able to easily access a number of machine learning classifiers which you can implement with relatively minimal lines of code. Even more, scikit learn allows you to visualize your output, showcasing your findings. Because of this, the library is often used in machine learning classes to teach what different classifiers do- much like the comparative output this Code Pattern highlights! Ready to dive in?
This Code Pattern consists of two activities:
Log into IBM's Watson Studio. Once in, you'll land on the dashboard.
Create a new project by clicking
+ New project and choosing
Enter a name for the project name and click
NOTE: By creating a project in Watson Studio a free tier
Object Storage service and
Watson Machine Learning service will be created in your IBM Cloud account. Select the
Free storage type to avoid fees.
Upon a successful project creation, you are taken to a dashboard view of your project. Take note of the
Settings tabs, we'll be using them to associate our project with any external assets (datasets and notebooks) and any IBM cloud services.
+ Add to projecton the top right and choose the
Fill in the following information:
From URLtab. 
Namefor the notebook and optionally a description. 
Notebook URLprovide the following url: https://github.com/IBM/predict-opioid-prescribers/blob/master/notebooks/opioid-prescription-prediction.ipynb 
Python 3.5option. 
TIP: Once successfully imported, the notebook should appear in the
Notebooks section of the
From the new project
Overview panel, click
+ Add to project on the top right and choose the
Data asset type.
A panel on the right of the screen will appear to assit you in uploading data. Follow the numbered steps in the image below.
browseoption. From your machine, browse to the location of the
perscriber-info.csvfiles in this repository, and upload it. [not numbered]
Now all assets should appear in your project overview.
(►) Run button to start stepping through the notebook.
Stop at the
Insert Pandas Data Frame sections.
Click on the
1001 data icon in the top right. The data files should show up.
Click on each and select
Insert Pandas Data Frame. Once you do that, a whole bunch of code will show up in the highlighted cell.
Make sure your
opioids.csv is saved as
overdoses.csv is saved as
prescriber_info.csv is saved as
df_data_3 so that it is consistent with the original notebook.
To get familiar with your data, explore it with visualizations and by looking at subsets of the data. For example, we see that though California has the highest overdoses, when we correct for population we see that West Virginia actually has the highest rate of overdoses per capita.
Every dataset has its imperfections. Let's clean ours up by making the States consistent and changing our columns to allow us to use them as integers.
You can check out the output in the notebook or in the image below. In this step we run several machine learning models in order to evaluate which is the most effective at predicting opioid prescribers. Though it is beyond the scope of this pattern, by predicting these opioid prescribers you are laying the framework to predict the likelihood that a certain type of doctor prescribes opioids. Additionally, if we had more years of data (beyond 2014) we could also predict future rates of overdoses. For now, we'll just take a look at the models.
After running various classifiers, we find that Random Forest, Gradient Boosting and our Ensemble models had the best performance on predicting opioid prescribers.
Awesome job following along! Now go try and take this further or apply it to a different use case!
This code pattern is licensed under the Apache Software License, Version 2. Separate third party code objects invoked within this code pattern are licensed by their respective providers pursuant to their own separate licenses. Contributions are subject to the Developer Certificate of Origin, Version 1.1 (DCO) and the Apache Software License, Version 2.