My essential data science tools

Over the last few years of working with data I’ve collected a toolbox of essential tools that I think all Data Scientists should know and use. All of these tools will not only help you be more efficient as a data scientist or data analyst, but will also help you work better within a team, be more organized and flexible and produce analyses that will be more easily reproducible by others. [Read More]

Window Functions in Redshift

One of the coolest things I learned about in my Redshift journey has been Window Functions. Although Window functions aren’t a novel feature and exists in other SQL databases, they are a really powerful tool to have in your analysis toolbelt and fits in really well with Redshift. Like the name suggests, Window Functions let you operate on frame or ‘window’ of data and return a value for each row in that result set. [Read More]

A machine that generates money with pandas-datareader and Prophet

What is this? This isn’t really a money machine, I’m just kidding about that, sorry. This is just a quick exploration of two awesome Python packages that I wanted to play with for a while Prophet for time series forecasting pandas_datareader for grabbing historic stock price data Prophet seems like an awesome project by Facebook to make state-of-the-art time series forecasting really easy and simple. I’ve been hoping to give it a try for a while. [Read More]

What is TF-IDF? The 10 minute guide

I recently started reading up a bit on tf-idf, which stands for term frequency-inverse document frequency. Tf-idf is a simple, but surprisingly powerful technique which can be used to figure out what a document is ‘about’. It’s often used in the fields of information retrieval and text mining. Documents? First, let’s just define what I mean with document. For our purposes, a document can be thought of all the words in a piece of text, broken down by how frequently each word appears in the text. [Read More]

A Redshift UDF to find AB test significance

I use Amazon’s Redshift every day. It’s an amazing database for data warehousing and analytics and allows you analyze huge datasets in a blazingly efficient manner using SQL. The reason why Redshift is so fast for analysis work is that unlike many other SQL databases, it uses columnar storage and is highly optimized for distributing workloads across a cluster of instances. Redshift is based on PostgreSQL 8.0.2., so it’s pretty familiar to anyone who’s used Postres or any other mainstream SQL dialect before. [Read More]