My Take on PyData Seattle 2017

PyData Seattle 2017 was held in Microsoft Redmond. It was might first time attending a non academic conference on data science. I have to say that I really enjoyed it mostly because several talks were hands on and provided jupyter notebooks. Also, most of the presenters were such good public speakers!

Here I am going to blog about the talks I enjoyed the most (among the ones I attended). I am going to make use of the many tweets in feed @PyDataSeattle:

PyData Seattle 2017 kicked off on the 7th of July in Microsoft Redmond.

Highlights of the Conference

The most retweetted tweet of this conference is by far the keynote by Jake VanderPlas: “PyData 101”. He gave a smooth introduction about the origin of PyData and the many packages for data science that are available out there. Python was born as scripting language and now it is also a more than valid alternative to R.

Another highlight of the conference was the upcoming release of JupyterLab. JupyterLab aims to be an interactive and collaborative computational environment for Jupyter.

Next I am going to talk about the most interesting talks I clustered in the following 3 categories: Data Science, Natural Language Processing (NLP), and Engineering.

Data Science

All keynotes where really good. Being the biggest sponsor, one of the keynote was about data science and AI at Microsoft. I really praise Microsoft for pushing AI into their stacks. Moreover, I like how Microsoft cognitive services are becoming a really powerful tool.

Jeffrey Heer discussed about interactive data visualisation. Some of his visualisations reminded me a bit the ones in GapMinder by Hans Rosling who recently passed away.

In “D’oh! Unevenly spaced time series analysis of The Simpsons in Pandas” Joe McCarthy showed through examples how to make the most out of Pandas. With Pandas you can wrangle data like you would do in SQL. However, you have much more power of doing other analysis too given that you are doing this directly in python. I also got to learn about which is a repository of cool data sets. For example, here you can find a data set about the Simpsons.

Valentina Staneva presented a jupyter notebook to do some amazing stuff with python: eg you can remove people out of a video with Non-negative Matrix Factorization (NMF).

Another very actionable example of data science applications is the one provided by Jean-Rene Gauthier and Ben Van Dyke on Customer Lifetime Value (CLV). Their aim is to predict the number of purchases a given user is going to make in the following month. They also provided a jupyter notebook.

Natural Language Processing (NLP)

I am discussing here few applications related to text analysis. I learnt a new way of visualising words that discriminate between two categories. ScatterText presented by Jason Kessler helps identifying the words that are specific to a particular category and discriminating them to the words the are just frequent overall.

Various state-of-the-art deep learning models to get embeddings out of text data have been implemented and tested in Keras by Sujit Pal. An embedding is a compact representation of a document which can be used to achieve higher classification accuracy than the standard bag-of-words approach. Notebook available in the tweet.

A few talks were about scraping the web to collect text data.
Python Web Scraping” by Lingqiang Kong and “High Fidelity Web Crawling in Python” by Josh Weissbock. I also tested python capabilities for scraping the web. If most of your data analysis pipeline is in python why not also scraping the web with python.


Other talks focused on more engineering related data science issues. For example, the Katrina Rieh presented the engineering challenges to get data science going at

I have started to appreciate Visual Studio since I am in Microsoft. Quite recently, Visual Studio got to support Python and R. This is very handy because now it is possible to use its debugging tools for both Python and R. Moreover, it also allows to have mixed projects for example in C, C#, and Python and debug them altogether.

Ultimately, I got to know that you can scale up your Python computation using Dask.

There was also a nice overview on how to use Apache Spark and TensorFrames for deep learning using the Data Bricks infrastructure.

I am sure I missed something, so feel free to comment if I missed to mention the talk you like.