CST 383: Introduction to Data Science

Course Description

In data science, data analysis and machine learning techniques are applied to visualize data, understand trends, and make predictions. In this course students will learn how to obtain data, preprocess it, apply machine learning methods, and visualize the results. A student who completes the course will have enough theoretical knowledge, and enough skill with modern statistical programming languages and their libraries, to define and perform complete data science projects.

Key Points

  • Learned to use popular Data Science libraries such as Numpy and Pandas
  • Applied machine learning models to predict outcomes

Final Project

Video of our group presenting the project.

For this project we used three different models: KNN, Decision Tree, and SVC. There was a total of 5 observations. In the first three observations we used all races in the dataset (‘asian/pacific islander’, ‘black’,’hispanic’, ‘white’, ‘other’). We noticed a low accuracy scores for these models. There was a slight improvement in accuracy score when we used the Decision Tree and SVC models but overall scores remained below 50%.

Based on the distribution of race on the dataset, we wanted to see how the models would predict if the subject was White or another race. This led us to conduct 3 additional observations. To do this we combined all the other races that are not Black and performed the same three models. Our accuracy scores improved from tuning our data and obtained scores higher than 50%. The Decision Tree model gave us the best accuracy score.

Apart from our models we also examined the effect of features. We noticed that increasing the number of predictor features did not improve our accuracy score.We also saw how increasing the training set size did not improve our test accuracy score. In addition, we saw cases of both overfitting and underfitting in our learning curves.

Our project’s objective was to see if we can predict the subject’s race based on two features, outcome and subject age. Based on our models, our results show that there is a better chance of predicting if the subject is either Black or another race.