Analyzing PyPI package download statistics and building a data application with dstack

Subscribe to our newsletter and never miss any upcoming articles

Even though Jupyter notebooks are indispensable to writing code and interactive engagement with data, there is still a need to build applications for non-technical audiences. Fortunately, the Python ecosystem offers a wide and vibrant open-source ecosystem not only for data analysis but also for building data applications using the data and visualization libraries.

In this article, I will try to give you an overview of how one can build an interactive data application with dstack. We will also learn how to analyze and visualize data with Python, pandas, and plot.ly, and query data from Google's BigQuery using Python and pandas.

For the sake of education, we will analyze the download statistics for Python packages and make our application visualize the following:

  • The most downloaded packages
  • The number of downloads in the last one, six and twelve months
  • The number of downloads per version of the package, operational system, and version of Python

We'll build an interactive application that visualizes these statistics in the form of a dashboard using dstack. The application built in this tutorial can be found here.

This tutorial may be interesting to beginner data scientists who aspire to learn Python in the context of analyzing data.

Note, the analysis in this notebook serves educational purposes only. In case you find a mistake or have a question related to the code or the results, please drop an email to team@dstack.ai.

This tutorial was originally inspired by pepy.tech.

On Python package download statistics dataset

Why use Python package download statistics for this tutorial? First of all, because the data is publicly available. Secondly, because PyPI offers means to access this data conveniently. Thirdly, making an application to visualize this data may serve an excellent example of building data applications with Python.

Where is the data for this analysis taken from? The data on downloads of Python packages is stored by PyPI in Google's BigQuery. The download data for every day is stored in a separate table that matches the pattern the-psf:pypi.downloadsYYYYMMDD. These tables have the this schema.

Note that this data in BigQuery can be accessed only if you have your own Google Cloud account. To access this data, you must pay according to BigQuery's pricing which is $5.00 per TB. Given the amount of data, querying may be very expensive. The more you query, the more the data is being scanned and the more you pay.

Passing Google Cloud credentials

To connect to Google's BigQuery, we'll have to authenticate. BigQuery, being a part of Google Cloud, for authentication, requires passing Google credentials via OAuth 2.0.

It is possible to connect to BigQuery from Python in many ways. In this tutorial, we will use pandas, which is the easiest way of working with BigQueryusing pandas_gbq.

For authentication, pandas_gbq integrates the google-auth library.

In order to start working with pandas_gbq, you have to pass credentials and the name of the project, the credentials are associated with.

In this tutorial, we will use Service Account credentials. This type of credentials must be created through the Google Cloud Console. Once the credentials are created, the corresponding JSON file must be downloaded to be used from this notebook.

Note, the JSON file with credentials must be stored only in a secure environment.

We'll use the following code to load credentials from Python:

from google.oauth2 import service_account
credentials = service_account.Credentials.from_service_account_file("google-credentials.json")

Now, we must initialize the context of pandas_gbq with the acquired credentials and the name of the project:

import pandas_gbq

pandas_gbq.context.credentials = credentials
pandas_gbq.context.project = "project-name" # Important, here must be the name of your Google Cloud project

Note, before using pandas_gbq, make sure you've enabled BigQuery API for this project, and the Service Role associated with the acquired credentials have permissions to access BigQuery and run BigQuery jobs.

Now, we're all set to query the PyPI package download statistics data from Python.

On the application that we'll build

In this example, we are going to use dstack to build an application that shows the downloads for individual projects grouped by project version, Python version, and operational system. To keep the tutorial simple enough, we're going to limit it to show only the data for the top 100 most downloaded projects.

Why dstack? First, building and deploying dstack applications takes days compared to other frameworks. Secondly, dstack doesn't require you to do web programming, writing UI scripts, or other things that are normally needed to develop and deploy applications. Instead of programming an application, you can push the data to a dstack server and provide a declarative logic of how it should interact with user input. In the beginning, you'll be surprised by this approach but later you'll see how simple yet powerful it is.

How is the application going to work? In our application, we'll query the data using pandas_gbq, make visualizations using `plotly' and push resulting visualizations associated with parameters such as:

  • The name of the project
  • The period of time (e.g. Past 30 days, Past 6 months, Past 12 months)
  • The dimension of the data (e.g. grouped by project version, Python version, or operational system)

Once we push the visualizations with all parameters, dstack will automatically compile it to a dashboard. The application built in this tutorial can be found here.

Most of this tutorial will be a repetition of one simple step:

  1. Running a SQL query aggregating downloads over a period of X, grouped by project name, Y and Z, where
  2. X/Y may be 30 days/day, 6 months/week, and 12 months/month
  3. Z can be project version, python version, and operational system
  4. Making a plot
  5. Pushing a plot to dstack with the corresponding parameters such as project name, X, and Z to dstack

Disclaimer

Please note that some of the queries below may query the BigQuery data over 6 and more months. Avoid running the whole notebook when it's unnecessary. Read the Optimization section on how to avoid running unnecessary queries to BigQuery.

Getting the 100 top most-downloading PyPI projects

In this tutorial, we will focus only on the top 100 most downloaded projects. The first thing we will do is getting the names of these 100 projects. For that, we'll aggregate the downloads by the project over the last 30 days, sort them in a descending manner, and take the first 100.

Here's how we'll do it with pandas_gbq:

top_df = pandas_gbq.read_gbq("SELECT file.project, COUNT(*) AS downloads FROM `the-psf.pypi.downloads*` WHERE _TABLE_SUFFIX BETWEEN FORMAT_DATE( '%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)) AND FORMAT_DATE('%Y%m%d', CURRENT_DATE()) GROUP BY file.project ORDER BY downloads DESC LIMIT 100")

As you see, the result of the pandas_gbq.read_gbq function is a pandas.DataFrame.

The most important parts of the query are the name of the table which is the-psf.pypi.downloads* and the WHERE clause that uses _TABLE_SUFFIX. What does it do? It scans the data only from the tables that start the-psf.pypi.downloads and end with a date within the last 30 days of the current date. This way, we ensure that the query scans only the data within the last 30 days. Why is this important? This is important because we use BigQuery where wee pay per TB of scanned data. To optimize the cost of the query it's always important to limit the amount of data being scanned. Later in the post, we'll get back to the pricing factor again and will see at the other ways to optimize it.

Meanwhile, let's declare a variable top_projects and assign it to the list of the names of the top 100 most downloaded projects:

top_projects = top_df["project"].tolist()

Getting downloads per project version for the past 30 days

Here's the first SQL query:

import pandas as pd

df_per_system_1mv = pandas_gbq.read_gbq("SELECT file.project, COUNT(*) AS downloads, SUBSTR(_TABLE_SUFFIX, 1, 8) AS day, file.version FROM `the-psf.pypi.downloads*` WHERE _TABLE_SUFFIX BETWEEN FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 MONTH)) AND FORMAT_DATE('%Y%m%d', CURRENT_DATE()) AND file.project IN UNNEST(" + str(top_projects) + ") GROUP BY file.project, day, version")
df_per_system_1mv["day"] = pd.to_datetime(df_per_system_1mv["day"], format='%Y%m%d')

Now that we have a dataframe with the data we need, we will make a plot with downloads for every version of the project. The first five versions will be put as separate charts. The rest of it is going to be one chart named Other. Here's an example for the project pandas:

from packaging import version
import plotly.express as px
import plotly.graph_objects as go

df = df_per_system_1mv[df_per_system_1mv["project"] == 'pandas']
top_versions = [str(x) for x in sorted(df['version'].apply(version.parse).unique(), reverse = True)[:5]]

fig = go.Figure()
for i, version in enumerate(top_versions):
    df_i = df[df["version"] == version]
    fig.add_trace(go.Scatter(x=df_i["day"], y=df_i["downloads"], line_shape='spline', line=dict(width = 1, color=px.colors.cyclical.mygbm[i]), stackgroup='one', name = version))

df_other = df[~df["version"].isin(top_versions)]
df_other_aggregated = df_other.groupby(['project', 'day'])['downloads'].sum().reset_index()
fig.add_trace(go.Scatter(x=df_other_aggregated["day"], y=df_other_aggregated["downloads"], line_shape='spline', line=dict(width = 1, color=px.colors.cyclical.mygbm[len(top_versions)]), stackgroup='one', name = 'Other'))
fig.update_traces(mode='lines+markers')

pypi_1mv.png

Getting downloads per Python version for the past 30 days

Here's almost the same query as above but now it groups data by the major Python version (REGEXP_EXTRACT(details.python, r'^\d*')):

df_per_system_1mp = pandas_gbq.read_gbq("SELECT file.project, COUNT(*) AS downloads, SUBSTR(_TABLE_SUFFIX, 1, 8) AS day, REGEXP_EXTRACT(details.python, r'^\d*') python FROM `the-psf.pypi.downloads*` WHERE _TABLE_SUFFIX BETWEEN FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 MONTH)) AND FORMAT_DATE('%Y%m%d', CURRENT_DATE()) AND file.project IN UNNEST(" + str(top_projects) + ") GROUP BY file.project, day, python ORDER BY day")
df_per_system_1mp["day"] = pd.to_datetime(df_per_system_1mp["day"], format='%Y%m%d')

A sample plot for the project pandas:

df = df_per_system_1mp[df_per_system_1mp["project"] == 'pandas']
df_2 = df[df["python"] == "2"]
df_3 = df[df["python"] == "3"]

fig = go.Figure()
fig.add_trace(go.Scatter(x=df_3["day"], y=df_3["downloads"], line_shape='spline', line=dict(width = 1, color=px.colors.cyclical.mygbm[0]), stackgroup='one', name = "Python 3"))
fig.add_trace(go.Scatter(x=df_2["day"], y=df_2["downloads"], line_shape='spline', line=dict(width = 1, color=px.colors.cyclical.mygbm[3]), stackgroup='one', name = "Python 2"))
fig.update_traces(mode='lines+markers')

pypi_1mp.png

Getting downloads per operational system for the past 30 days

Here's another query now grouping by operational system:

df_per_system_1m = pandas_gbq.read_gbq("SELECT file.project, COUNT(*) AS downloads, SUBSTR(_TABLE_SUFFIX, 1, 8) AS day, CASE details.system.name WHEN 'Darwin' THEN 'Darwin' WHEN 'Linux' THEN 'Linux' WHEN 'Windows' THEN 'Windows' ELSE 'Other' END as system FROM `the-psf.pypi.downloads*` WHERE _TABLE_SUFFIX BETWEEN FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 MONTH)) AND FORMAT_DATE('%Y%m%d', CURRENT_DATE()) AND file.project IN UNNEST(" + str(top_projects) + ") GROUP BY file.project, day, system ORDER BY day")
df_per_system_1m["day"] = pd.to_datetime(df_per_system_1m["day"], format='%Y%m%d')

A sample plot for the project pandas:

df = df_per_system_1m[df_per_system_1m["project"] == 'pandas']
df_linux = df[df["system"] == "Linux"]
df_windows = df[df["system"] == "Windows"]
df_darwin = df[df["system"] == "Darwin"]
df_other = df[df["system"] == "Other"]

fig = go.Figure()
fig.add_trace(go.Scatter(x=df_linux["day"], y=df_linux["downloads"], line_shape='spline', line=dict(width = 1, color=px.colors.cyclical.mygbm[0]), stackgroup='one', name = "Linux"))
fig.add_trace(go.Scatter(x=df_darwin["day"], y=df_darwin["downloads"], line_shape='spline', line=dict(width = 1, color=px.colors.cyclical.mygbm[3]), stackgroup='one', name = "Darwin"))
fig.add_trace(go.Scatter(x=df_windows["day"], y=df_windows["downloads"], line_shape='spline', line=dict(width = 1, color=px.colors.cyclical.mygbm[6]), stackgroup='one', name = "Windows"))
fig.add_trace(go.Scatter(x=df_other["day"], y=df_other["downloads"], line_shape='spline', line=dict(width = 1, color=px.colors.cyclical.mygbm[9]), stackgroup='one', name = "Other"))
fig.update_traces(mode='lines+markers')

pypi_1m.png

Getting downloads per project version for the past 6 months

Over the following cells, we'll repeat the steps above but now for different X, Y, Z.

df_per_system_6mv = pandas_gbq.read_gbq("SELECT file.project, COUNT(*) AS downloads, TIMESTAMP_TRUNC(timestamp, WEEK) day, file.version FROM `the-psf.pypi.downloads*` WHERE _TABLE_SUFFIX BETWEEN FORMAT_DATE('%Y%m01', DATE_SUB(CURRENT_DATE(), INTERVAL 6 MONTH)) AND FORMAT_DATE('%Y%m%d', CURRENT_DATE()) AND file.project IN UNNEST(" + str(top_projects) + ") GROUP BY file.project, day, version ORDER BY day")
df_per_system_6mv["day"] = pd.to_datetime(df_per_system_6mv["day"], format='%Y%m%d')

A sample plot for the project pandas:

from packaging import version
df = df_per_system_6mv[df_per_system_6mv["project"] == 'pandas']
top_versions = [str(x) for x in sorted(df['version'].apply(version.parse).unique(), reverse = True)[:10]]

fig = go.Figure()
for i, version in enumerate(top_versions):
    df_i = df[df["version"] == version]
    fig.add_trace(go.Scatter(x=df_i["day"], y=df_i["downloads"], line_shape='spline', line=dict(width = 1, color=px.colors.cyclical.mygbm[i]), stackgroup='one', name = version))

df_other = df[~df["version"].isin(top_versions)]
df_other_aggregated = df_other.groupby(['project', 'day'])['downloads'].sum().reset_index()
fig.add_trace(go.Scatter(x=df_other_aggregated["day"], y=df_other_aggregated["downloads"], line_shape='spline', line=dict(width = 1, color=px.colors.cyclical.mygbm[len(top_versions)]), stackgroup='one', name = 'Other'))
fig.update_traces(mode='lines+markers')
fig.update_xaxes(range=[df_per_system_6mv["day"].iloc[0], df_per_system_6mv["day"].iloc[-1]])

pypi_6mv.png

Getting downloads per Python version for the past 6 months

df_per_system_6mp = pandas_gbq.read_gbq("SELECT file.project, COUNT(*) AS downloads, TIMESTAMP_TRUNC(timestamp, WEEK) day, REGEXP_EXTRACT(details.python, r'^\d*') python FROM `the-psf.pypi.downloads*` WHERE _TABLE_SUFFIX BETWEEN FORMAT_DATE('%Y%m01', DATE_SUB(CURRENT_DATE(), INTERVAL 6 MONTH)) AND FORMAT_DATE('%Y%m%d', CURRENT_DATE()) AND file.project IN UNNEST(" + str(top_projects) + ") GROUP BY file.project, day, python ORDER BY day")
df_per_system_6mp["day"] = pd.to_datetime(df_per_system_6mp["day"], format='%Y%m%d')

A sample plot for the project pandas:

from dateutil.relativedelta import relativedelta

df = df_per_system_6mp[df_per_system_6mp["project"] == 'pandas']
df_2 = df[df["python"] == "2"]
df_3 = df[df["python"] == "3"]

fig = go.Figure()
fig.add_trace(go.Scatter(x=df_3["day"], y=df_3["downloads"], line_shape='spline', line=dict(width = 1, color=px.colors.cyclical.mygbm[0]), stackgroup='one', name = "Python 3"))
fig.add_trace(go.Scatter(x=df_2["day"], y=df_2["downloads"], line_shape='spline', line=dict(width = 1, color=px.colors.cyclical.mygbm[3]), stackgroup='one', name = "Python 2"))
fig.update_traces(mode='lines+markers')
fig.update_xaxes(range=[df_per_system_6mp["day"].iloc[0] + relativedelta(weeks=1), df_per_system_6mp["day"].iloc[-1] - relativedelta(weeks=1)])
fig.show()

pypi_6mp.png

The most interesting part of this code is the invocation of fig.update_xaxes. To make sure all the data shown in the chart is complete for the given week, we shift the boundaries of the plot by one week right and left correspondingly to exclude the weeks that may have incomplete data.

Getting downloads per operational system for the past 6 months

df_per_system_6m = pandas_gbq.read_gbq("SELECT file.project, COUNT(*) AS downloads, TIMESTAMP_TRUNC(timestamp, WEEK) day, CASE details.system.name WHEN 'Darwin' THEN 'Darwin' WHEN 'Linux' THEN 'Linux' WHEN 'Windows' THEN 'Windows' ELSE 'Other' END as system FROM `the-psf.pypi.downloads*` WHERE _TABLE_SUFFIX BETWEEN FORMAT_DATE('%Y%m01', DATE_SUB(CURRENT_DATE(), INTERVAL 6 MONTH)) AND FORMAT_DATE('%Y%m%d', CURRENT_DATE()) AND file.project IN UNNEST(" + str(top_projects) + ") GROUP BY file.project, day, system ORDER BY day")
df_per_system_6m["day"] = pd.to_datetime(df_per_system_6m["day"], format='%Y%m%d')

A sample plot for the project pandas:

df = df_per_system_6m[df_per_system_6m["project"] == 'pandas']
df_linux = df[df["system"] == "Linux"]
df_windows = df[df["system"] == "Windows"]
df_darwin = df[df["system"] == "Darwin"]
df_other = df[df["system"] == "Other"]

fig = go.Figure()
fig.add_trace(go.Scatter(x=df_linux["day"], y=df_linux["downloads"], line_shape='spline', line=dict(width = 1, color=px.colors.cyclical.mygbm[0]), stackgroup='one', name = "Linux"))
fig.add_trace(go.Scatter(x=df_darwin["day"], y=df_darwin["downloads"], line_shape='spline', line=dict(width = 1, color=px.colors.cyclical.mygbm[3]), stackgroup='one', name = "Darwin"))
fig.add_trace(go.Scatter(x=df_windows["day"], y=df_windows["downloads"], line_shape='spline', line=dict(width = 1, color=px.colors.cyclical.mygbm[6]), stackgroup='one', name = "Windows"))
fig.add_trace(go.Scatter(x=df_other["day"], y=df_other["downloads"], line_shape='spline', line=dict(width = 1, color=px.colors.cyclical.mygbm[9]), stackgroup='one', name = "Other"))
fig.update_traces(mode='lines+markers')
fig.update_xaxes(range=[df_per_system_6m["day"].iloc[0] + relativedelta(weeks=1), df_per_system_6m["day"].iloc[-1] - relativedelta(weeks=1)])

pypi_6m.png

Getting downloads per project version for the past 12 months

df_per_system_12mv = pandas_gbq.read_gbq("SELECT file.project, COUNT(*) AS downloads, SUBSTR(_TABLE_SUFFIX, 1, 6) AS `day`, file.version FROM `the-psf.pypi.downloads*` WHERE _TABLE_SUFFIX BETWEEN FORMAT_DATE('%Y%m01', DATE_SUB(CURRENT_DATE(), INTERVAL 12 MONTH)) AND FORMAT_DATE('%Y%m%d', CURRENT_DATE()) AND file.project IN UNNEST(" + str(top_projects) + ") GROUP BY file.project, day, version ORDER BY day")
df_per_system_12mv["day"] = pd.to_datetime(df_per_system_12mv["day"], format='%Y%m')

A sample plot for the project pandas:

from packaging import version

df = df_per_system_12mv[df_per_system_12mv["project"] == 'pandas']
top_versions = [str(x) for x in sorted(df['version'].apply(version.parse).unique(), reverse = True)[:10]]

fig = go.Figure()
for i, version in enumerate(top_versions):
    df_i = df[df["version"] == version]
    fig.add_trace(go.Scatter(x=df_i["day"], y=df_i["downloads"], line_shape='spline', line=dict(width = 1, color=px.colors.cyclical.mygbm[i]), stackgroup='one', name = version))

df_other = df[~df["version"].isin(top_versions)]
df_other_aggregated = df_other.groupby(['project', 'day'])['downloads'].sum().reset_index()
fig.add_trace(go.Scatter(x=df_other_aggregated["day"], y=df_other_aggregated["downloads"], line_shape='spline', line=dict(width = 1, color=px.colors.cyclical.mygbm[len(top_versions)]), stackgroup='one', name = 'Other'))
fig.update_traces(mode='lines+markers')
fig.update_xaxes(range=[df_per_system_12mv["day"].iloc[0] + relativedelta(months=1), df_per_system_12mv["day"].iloc[-1] - relativedelta(months=1)])

pypi_12mv.png

Getting downloads per Python version for the past 12 months

df_per_system_12mp = pandas_gbq.read_gbq("SELECT file.project, COUNT(*) AS downloads, SUBSTR(_TABLE_SUFFIX, 1, 6) AS `day`, REGEXP_EXTRACT(details.python, r'^\d*') python FROM `the-psf.pypi.downloads*` WHERE _TABLE_SUFFIX BETWEEN FORMAT_DATE('%Y%m01', DATE_SUB(CURRENT_DATE(), INTERVAL 12 MONTH)) AND FORMAT_DATE('%Y%m%d', CURRENT_DATE()) AND file.project IN UNNEST(" + str(top_projects) + ") GROUP BY file.project, day, python ORDER BY day")
df_per_system_12mp["day"] = pd.to_datetime(df_per_system_12mp["day"], format='%Y%m')
df = df_per_system_12mp[df_per_system_12mp["project"] == 'pandas']
df_2 = df[df["python"] == "2"]
df_3 = df[df["python"] == "3"]

fig = go.Figure()
fig.add_trace(go.Scatter(x=df_3["day"], y=df_3["downloads"], line_shape='spline', line=dict(width = 1, color=px.colors.cyclical.mygbm[0]), stackgroup='one', name = "Python 3"))
fig.add_trace(go.Scatter(x=df_2["day"], y=df_2["downloads"], line_shape='spline', line=dict(width = 1, color=px.colors.cyclical.mygbm[3]), stackgroup='one', name = "Python 2"))
fig.update_traces(mode='lines+markers')
fig.update_xaxes(range=[df_per_system_12mp["day"].iloc[0] + relativedelta(months=1), df_per_system_12mp["day"].iloc[-1] - relativedelta(months=1)])

pypi_12mp.png

Getting downloads per operational system for the past 12 months

df_per_system_12m = pandas_gbq.read_gbq("SELECT file.project, COUNT(*) AS downloads, SUBSTR(_TABLE_SUFFIX, 1, 6) AS `day`, CASE details.system.name WHEN 'Darwin' THEN 'Darwin' WHEN 'Linux' THEN 'Linux' WHEN 'Windows' THEN 'Windows' ELSE 'Other' END as system FROM `the-psf.pypi.downloads*` WHERE _TABLE_SUFFIX BETWEEN FORMAT_DATE('%Y%m01', DATE_SUB(CURRENT_DATE(), INTERVAL 12 MONTH)) AND FORMAT_DATE('%Y%m%d', CURRENT_DATE()) AND file.project IN UNNEST(" + str(top_projects) + ") GROUP BY file.project, day, system ORDER BY day")
df_per_system_12m["day"] = pd.to_datetime(df_per_system_12m["day"], format='%Y%m')
df = df_per_system_12m[df_per_system_12m["project"] == 'pandas']
df_linux = df[df["system"] == "Linux"]
df_windows = df[df["system"] == "Windows"]
df_darwin = df[df["system"] == "Darwin"]
df_other = df[df["system"] == "Other"]

fig = go.Figure()
fig.add_trace(go.Scatter(x=df_linux["day"], y=df_linux["downloads"], line_shape='spline', line=dict(width = 1, color=px.colors.cyclical.mygbm[0]), stackgroup='one', name = "Linux"))
fig.add_trace(go.Scatter(x=df_darwin["day"], y=df_darwin["downloads"], line_shape='spline', line=dict(width = 1, color=px.colors.cyclical.mygbm[3]), stackgroup='one', name = "Darwin"))
fig.add_trace(go.Scatter(x=df_windows["day"], y=df_windows["downloads"], line_shape='spline', line=dict(width = 1, color=px.colors.cyclical.mygbm[6]), stackgroup='one', name = "Windows"))
fig.add_trace(go.Scatter(x=df_other["day"], y=df_other["downloads"], line_shape='spline', line=dict(width = 1, color=px.colors.cyclical.mygbm[9]), stackgroup='one', name = "Other"))
fig.update_traces(mode='lines+markers')
fig.update_xaxes(range=[df["day"].iloc[0], df_per_system_12m["day"].iloc[-1] - relativedelta(months=1)])

pypi_12m.png

Setting up dstack

Now that we have the data and saw visualizations, we are ready to set up a dstack server and build the application.

dstack is an open-source framework that can be installed via pip:

pip install dstack

Or using conda:

conda install -c dstackai dstack

dstack has two components:

  1. A server that hosts data and runs published applications
  2. A client Python package to push data to a server

A server can be started locally or on a dedicated server. In order to start a dstack server locally, on must run this command:

dstack server start --port 8081

Once the server is started, the user can use the Python client packages to push the data.

Optimization

Now that we've set up a dstack server, here is an important note on how to work with BigQuery or similar services that charge per scanned amount of data. The most optimal way of working with such services is to avoid re-running unnecessary queries that may be very costly. This can be easily done by caching the data locally on disc or some other way. This is especially important if you have to update visualizations at a regular schedule.

Funnily enough, you can use dstack also for storing cached data. This can be done the following way:

from dstack import create_frame, push

push("pypi/top-projects", top_df)

frame = create_frame("pypi/raw-data")

for df_period, period_name in [(df_per_system_1mv, 'Past 30 days'), (df_per_system_6mv, 'Past 6 months'), (df_per_system_12mv, 'Past 12 months')]:
    frame.commit(df_period, params = {'Period': period_name, 'Group by': 'Version'})

for df_period, period_name in [(df_per_system_1mp, 'Past 30 days'), (df_per_system_6mp, 'Past 6 months'), (df_per_system_12mp, 'Past 12 months')]:
    frame.commit(df_period, params = {'Period': period_name, 'Group by': 'Python'})

for df_period, period_name in [(df_per_system_1m, 'Past 30 days'), (df_per_system_6m, 'Past 6 months'), (df_per_system_12m, 'Past 12 months')]:
    frame.commit(df_period, params = {'Period': period_name, 'Group by': 'System'})

frame.push()

In the puts the date into two "stacks": "pypi/top-projects" and "pypi/raw-data". "stack" this is a concept of dstack. It is versioned (means it stores all revisions when being updated), and it can contain one or multiple artifacts associated with parameters. Such artifacts can include a dataframe, an ML model, or a visualization. In the code above we use stacks to store data frames. Once an artifact is published, it can be accessed with the dstack's web interface or pulled via Python code. Below is the code that does that:

from dstack import pull

top_df = pull("pypi/top-projects")

df_per_system_1mv = pull("pypi/raw-data", params={"Period": "Past 30 days", "Group by": "Version"})
df_per_system_1mp = pull("pypi/raw-data", params={"Period": "Past 30 days", "Group by": "Python"})
df_per_system_1m = pull("pypi/raw-data", params={"Period": "Past 30 days", "Group by": "System"})

df_per_system_6mv = pull("pypi/raw-data", params={"Period": "Past 6 months", "Group by": "Version"})
df_per_system_6mp = pull("pypi/raw-data", params={"Period": "Past 6 months", "Group by": "Python"})
df_per_system_6m = pull("pypi/raw-data", params={"Period": "Past 6 months", "Group by": "System"})

df_per_system_12mv = pull("pypi/raw-data", params={"Period": "Past 12 months", "Group by": "Version"})
df_per_system_12mp = pull("pypi/raw-data", params={"Period": "Past 12 months", "Group by": "Python"})
df_per_system_12m = pull("pypi/raw-data", params={"Period": "Past 12 months", "Group by": "System"})

As a result, the variables get assigned with the actual data frames that we stored. The pulled data frames have exactly the same schema when they were pushed. This approach is especially useful when the processing is done at the schedule in an automated manner.

Making a data application using dstack

Now that we've also covered how to set up a dstack server and how to cache data there, let's put everything together and finally build an application. You'll be surprised by how little it's left to do to build an application from the ground up.

Here's the code that builds the entire application:

frame = create_frame("pypi/downloads")

for df_period, period_name, period_step in [(df_per_system_1mv, 'Past 30 days', relativedelta(days=1)), (df_per_system_6mv, 'Past 6 months', relativedelta(weeks=1)), (df_per_system_12mv, 'Past 12 months', relativedelta(months=1))]:
    for project in top_projects:
        from packaging import version
        df = df_period[df_period["project"] == project]
        top_versions = [str(x) for x in sorted(df['version'].apply(version.parse).unique(), reverse = True)[:10]]

        fig = go.Figure()
        for i, version in enumerate(top_versions):
            df_i = df[df["version"] == version]
            fig.add_trace(go.Scatter(x=df_i["day"], y=df_i["downloads"], line_shape='spline', line=dict(width = 1, color=px.colors.cyclical.mygbm[i]), stackgroup='one', name = version))

        df_other = df[~df["version"].isin(top_versions)]
        df_other_aggregated = df_other.groupby(['project', 'day'])['downloads'].sum().reset_index()
        fig.add_trace(go.Scatter(x=df_other_aggregated["day"], y=df_other_aggregated["downloads"], line_shape='spline', line=dict(width = 1, color=px.colors.cyclical.mygbm[len(top_versions)]), stackgroup='one', name = 'Other'))
        fig.update_traces(mode='lines+markers')
        fig.update_xaxes(range=[df_period["day"].iloc[0] + period_step, df_period["day"].iloc[-1] - period_step])
        frame.commit(fig, params = {"Project": project, 'Group by': 'Version', 'Period': period_name})
        print(str({"Project": project, 'Group by': 'Version', 'Period': period_name}))

for df_period, period_name, period_step in [(df_per_system_1mp, 'Past 30 days', relativedelta(days=1)), (df_per_system_6mp, 'Past 6 months', relativedelta(weeks=1)), (df_per_system_12mp, 'Past 12 months', relativedelta(months=1))]:
    for project in top_projects:
        df = df_period[df_period["project"] == project]
        df_2 = df[df["python"] == "2"]
        df_3 = df[df["python"] == "3"]

        fig = go.Figure()
        fig.add_trace(go.Scatter(x=df_3["day"], y=df_3["downloads"], line_shape='spline', line=dict(width = 1, color=px.colors.cyclical.mygbm[0]), stackgroup='one', name = "Python 3"))
        fig.add_trace(go.Scatter(x=df_2["day"], y=df_2["downloads"], line_shape='spline', line=dict(width = 1, color=px.colors.cyclical.mygbm[3]), stackgroup='one', name = "Python 2"))
        fig.update_traces(mode='lines+markers')
        fig.update_xaxes(range=[df["day"].iloc[0] + period_step, df["day"].iloc[-1] - period_step])
        frame.commit(fig, params = {"Project": project, 'Group by': 'Python', 'Period': period_name})
        print(str({"Project": project, 'Group by': 'Python', 'Period': period_name}))

for df_period, period_name, period_step in [(df_per_system_1m, 'Past 30 days', relativedelta(days=1)), (df_per_system_6m, 'Past 6 months', relativedelta(weeks=1)), (df_per_system_12m, 'Past 12 months', relativedelta(months=1))]:
    for project in top_projects:
        df = df_period[df_period["project"] == project]
        df_linux = df[df["system"] == "Linux"]
        df_windows = df[df["system"] == "Windows"]
        df_darwin = df[df["system"] == "Darwin"]
        df_other = df[df["system"] == "Other"]

        fig = go.Figure()
        fig.add_trace(go.Scatter(x=df_linux["day"], y=df_linux["downloads"], line_shape='spline', line=dict(width = 1, color=px.colors.cyclical.mygbm[0]), stackgroup='one', name = "Linux"))
        fig.add_trace(go.Scatter(x=df_darwin["day"], y=df_darwin["downloads"], line_shape='spline', line=dict(width = 1, color=px.colors.cyclical.mygbm[3]), stackgroup='one', name = "Darwin"))
        fig.add_trace(go.Scatter(x=df_windows["day"], y=df_windows["downloads"], line_shape='spline', line=dict(width = 1, color=px.colors.cyclical.mygbm[6]), stackgroup='one', name = "Windows"))
        fig.add_trace(go.Scatter(x=df_other["day"], y=df_other["downloads"], line_shape='spline', line=dict(width = 1, color=px.colors.cyclical.mygbm[9]), stackgroup='one', name = "Other"))
        fig.update_traces(mode='lines+markers')
        fig.update_xaxes(range=[df["day"].iloc[0] + period_step, df["day"].iloc[-1] - period_step])
        frame.commit(fig, params = {"Project": project, 'Group by': 'System', 'Period': period_name})
        print(str({"Project": project, 'Group by': 'System', 'Period': period_name}))

frame.push()

What we do in this code is:

  1. Creating a frame (the function dstack.create_frame)
  2. For every visualization, creates an attachment within the frame along with the visualization figure and associated parameters such as "Project", "Period", and "Group by" (the function dstack.StackFrame.commit)
  3. Push the frame (the function dstack.StackFrame.push)

This code can be significantly simplified if the code that builds visualizations is moved to separate functions:

frame = create_frame("pypi/downloads")

for df_period, period_name, period_step in [(df_per_system_1mv, 'Past 30 days', relativedelta(days=1)), (df_per_system_6mv, 'Past 6 months', relativedelta(weeks=1)), (df_per_system_12mv, 'Past 12 months', relativedelta(months=1))]:
    for project in top_projects:
        fig = plot_project_versions(project, df_period, period_name, period_step)
        frame.commit(fig, params = {"Project": project, 'Group by': 'Version', 'Period': period_name})
        print(str({"Project": project, 'Group by': 'Version', 'Period': period_name}))

for df_period, period_name, period_step in [(df_per_system_1mp, 'Past 30 days', relativedelta(days=1)), (df_per_system_6mp, 'Past 6 months', relativedelta(weeks=1)), (df_per_system_12mp, 'Past 12 months', relativedelta(months=1))]:
    for project in top_projects:
        fig = plot_python_versions(project, df_period, period_name, period_step)
        frame.commit(fig, params = {"Project": project, 'Group by': 'Python', 'Period': period_name})
        print(str({"Project": project, 'Group by': 'Python', 'Period': period_name}))

for df_period, period_name, period_step in [(df_per_system_1m, 'Past 30 days', relativedelta(days=1)), (df_per_system_6m, 'Past 6 months', relativedelta(weeks=1)), (df_per_system_12m, 'Past 12 months', relativedelta(months=1))]:
    for project in top_projects:
        fig = plot_systems(project, df_period, period_name, period_step)
        frame.commit(fig, params = {"Project": project, 'Group by': 'System', 'Period': period_name})
        print(str({"Project": project, 'Group by': 'System', 'Period': period_name}))

frame.push()

As you see the code related to the application itself is even smaller than the code dedicated to making plots.

What happens after you push this data to a dstack server? The server on the fly builds an interactive form where you can change the values for Project, Period, and Group by and see the visualization that corresponds to the selected values. Since the data is pre-calculated, it can be served from cache and thus scale efficiently under high load (e.g. if the application is used by a large team).

Screenshot 2020-08-27 at 12.44.54.png

Scheduling regular jobs

As you saw above a dstack application can be a static report that is updated manually or automatically at a regular schedule.

Automation will require you to set up regular jobs that run the code above. While the data for 12 and 6 months can be scheduled on a monthly basis, the daily job may run daily or weekly. This way you'll optimize the costs and ensure that the data is always up-to-date.

The Docker version of dstack as well as the in-cloud version do offer built-in Jobs. Using these Jobs you can schedule updating stacks at the needed cadence.

That is it for now. I hope this article will help you learn how to build data applications with Python and stack. Thank you for finding the time to read this.

dstack is a very new tool and currently offers quite basic functionality, they plan to bring more advanced features within the following months. You're welcome to check out their public roadmap, submit feature requests to their issue tracker, and of course, share your feedback in their Discord Chat.

Resources

Below is the list of resources that you may find useful:

No Comments Yet