Search
  • Brian Chow

What's a Kaggle?

I know you have heard of a gaggle of geese, a pod of whales or pack of wolves, but have you heard of a kaggle of data scientist? Well now you have, Kaggle (recently purchased by Google) is an online portal for data science competitions and publicly available data sets. The premise is great, connect organization that have real world use cases with data science experts. This way the moon-shot approach of information technology can be democratized under the wonderful approach of a prized competition. Sounds very similar to X Prizes used to achieve commercial space flight innovation.


The Kaggle platform allows organization to post a competition with sometimes large prize money and offer a problem that might be solved better with an algorithmic solution. In many of the posted competitions the problem is well understood and current algorithms are deployed. So why do organization turn to Kaggle as an approach to improve their existing solutions. It seems that the overall AI industry is still in the early stages trying to figure out the standards and best practices in order to achieve mass adoption and more accurate results. So basically, very similar to the commercial space industry just before X Prizes were created.


I took a Kaggle competition back when I was doing an online course. I partook in the Titanic competition which was very eye opening. The competition lets you use any technology you want to solve a simple problem. It provides a sample data set, in this case a manifest sample of Titanic passengers. This data set has several columns that provide name, age, gender and class (where on the ship the passenger stayed, first class etc.). Lastly, this data set provides which passengers actually survived the Titanic disaster and this is the main goal of the competition.


Basically, create an AI model that will accurately predict the survivors of the Titanic based on all columns provided or at a minimum, in the spirit of data science, provide the probability of survival (e.g. 0.0234325). This competition is great and a perfect method for teaching analytics and forecasting. The funny thing about the overall competition, at the time I took it, was that the data set was dirty. Some fields had no data and others seemed to break rules regarding accurate data gathering. For example, children were counted as dependents and some key information was missing. So, like all data projects the first steps were cleaning, grouping and, dare I say, making assumptions.


At first glance you could see possible metrics to improve survival. Of course, many woman and children survived the Titanic disaster. Mostly likely related to the old nautical philosophy "Woman and Children First!" Another powerful metric was ship class. First class woman and children survival chances were higher, whereas single men in steerage class were much lower.


I didn't win first prize and it was a very popular test with about 70k participants. The people that did win achieved kudos for their efforts and perhaps a job offer. What it shows is that Kaggle's method for AI innovation is powerful and working to solve the world’s most complex data problems one competition at a time.


What does the future hold, I guess Kaggle could hold a competition to predict this? They could take all the past competitions and provide them as a data set to a new competition that states "looking for the quantitative probably that can accurately predict how many competitions will be accurately predicted." But seriously this type of portal works very well rewarding teams with money and fostering experiments with minimal up-front costs to organizations. This platform is so popular that some of AI’s biggest names have partaken in competitions and won. Wow. If you have aspirations to be a data scientist then this is a great place to start.


13 views

Recent Posts

See All

©2020 by Bigfile. Proudly created with Wix.com