Publish, track and share data analysis results without development skills

Subscribe to our newsletter and never miss any upcoming articles

Introducing dstack.ai —a publication service with Python and R libraries

In the past few years, a lot of companies have invested significantly to automate and collaborate on canned data reports that focus on operational metrics. However, in my experience, there is still a lot of scope for improvement when it comes to collaboration around the ad-hoc exploratory analysis. This is especially true for teams of data scientists who have constant new challenges such as prototyping ML models and therefore cannot rely on traditional BI tools.

As a result, there is still a lot of reliance on the cumbersome process of using local network drives or emails to publish, track, and share data reports with their peers or clients. Data scientists also resort to sharing Jupyter notebooks with their peers, but notebooks are still poor for presenting results especially to non-technical audiences.

A growing community has started using solutions such as Dash, Shiny, etc. to collaborate on data reports — especially data visualizations and exploratory data analysis, which is actually pretty cool. Unfortunately, using these apps requires some extra skills or resources — expertise on HTML, CSS, and JS to render the apps, understanding of client-server architecture to publish reports, and even buying additional licenses to access operational reports such as dashboards — which isn't a great experience for data scientists.

Others with substantial development budgets are increasingly building their own custom applications. But as each business problem is unique, it also means that each proprietary solution comes at its own cost such as supporting new libraries for new kinds of use cases, maintaining their own infrastructure, updating security patches and fixing bugs. It is also beyond the reach of many enterprises to invest in one or more teams that can build and maintain such toolchain solutions.

With this, we think that there is definitely a need for a tool tailored to data scientists for collaborative exploration of data, i.e., one which supports all major libraries used by data scientists, and

  • does not require web development skills such as HTML, CSS,
  • does not require application development and deployment knowhow
  • makes reports easily accessible for a non-technical audience

In an attempt to find an elegant solution for collaborative data exploration, I and a couple of friends have built dstack.ai — a simple tool that offers Python and R libraries to publish, track revisions and share datasets and visualizations.

In the following sections, we will show how dstack can help you to collaborate on your ad-hoc exploratory analysis and ML model prototypes with your teams and clients.

Python or R is all you need to use dstack

Let us start by setting up a profile. This requires you to create an account at dstack.ai. Once you have an account, you can log in and copy your token from the settings. The token and your dstack username are used to configure your profile via the command-line utility. You can use pip or conda to access the dstack package and a command-line tool.

pip install dstack
dstack config --token <TOKEN> --user <USER>

Once your profile is configured, you are now ready to use dstack libraries.

No need for CSS, HTML or complex deployment processes to publish your data analysis

Once you are ready with your code, you can use the dstack APIs to publish both static and interactive visualizations and datasets to the front-end of dstack.ai.

The APIs support most popular visualization libraries in Python and R such as Matplotlib, ggplot2, Plotly, and Bokeh. The APIs can be also used from anywhere — notebooks, scripts or applications. In this way, you do not need to embed your reports in an HTML report to be shared by email or take the hassle of learning server-side scripting to host it yourself.

To illustrate this, let us take an example where you want to publish a static chart. In this case, you import a method _pushframe from dstack python library.

from dstack import push_frame

Once you plot your data (in this case, Matplotlib is used as a plotting library), you can use another method _pushframe from dstack library to publish the visualization.

import matplotlib.pyplot as plt
from dstack import push_frame

fig = plt.figure()
plt.plot([1, 2, 3, 4], [1, 4, 9, 16])

push_frame("simple", fig, "My first plot")

You can learn more about using other visualization libraries at docs.dstack.ai

The published chart can be accessed via the following URL:https://dstack.ai/<user>/simple

Notice how both the URL and the function _pushframe contains a value “simple”, which we call as a stack.

In essence,

  • A stack is a unique stream of data, which can have many frames, but points to the latest frame that is pushed for publication.
  • A frame is a revision of your published data and can consist of a list of attachments.
  • An attachment can be a plot or a chart or a graph or even a pandas dataframe that can have its own set of parameters.

dstack allows you to publish interactive visualizations and datasets

dstack abstraction constructs such as stacks, frames, and attachments together with the supported librariesallows you to keep the published reports interactive, i.e., based on the parameters sent to the front-end of dstack.ai, one can select and view the data corresponding to the selected parameters.

As an example, let’s publish an interactive visualization on a dataset using Plotly and use the dstack frame to pass the parameters.

import plotly.express as px
from dstack import create_frame

df = pd.read_csv("C:/Users/riwaj/Desktop/player_data.csv").dropna()
frame = create_frame("College_player_data")

pdf = df['college'].value_counts().rename_axis('college').reset_index(name='players').head(10)
fig = px.bar(pdf, x='college', y='players')
frame.commit(fig, f"Top 10 colleges by number of players", { "College": "Top 10 colleges" })

frame.push()

As you can see, in this example, we used three dstack methods to publish the visualization.

frame = create_frame("College_player_data") 
frame.commit(fig, f"Top 10 colleges by number of players", { "College": "Top 10 colleges" }) 
frame.push()

The snapshot below shows the resulting interactive data visualization in dstack UI.

Image for post

Notice how Plotly features such as zoom, scroll, etc. are also available in the visualization

Now, what if you want to publish all the datasets? In this case, you can commit multiple datasets in the same frame with the parameters you want to use for interactivity.

for college in df["college"].unique():
    players = df.loc[df["college"] == college]
    frame.commit(players, f"Players from {college}", { "College": college })

This way you can have interactive datasets and visualization available in the same stack.

import plotly.express as px
from dstack import create_frame

df = pd.read_csv("C:/Users/riwaj/Desktop/player_data.csv").dropna()
frame = create_frame("College_player_data")

pdf = df['college'].value_counts().rename_axis('college').reset_index(name='players').head(10)
fig = px.bar(pdf, x='college', y='players')
frame.commit(fig, f"Top 10 colleges by number of players", { "College": "Top 10 colleges" })

for college in df["college"].unique():
    players = df.loc[df["college"] == college]
    frame.commit(players, f"Players from {college}", { "College": college })

frame.push()

An example code snippet for interactive datasets and visualization

You can view the published stack at dstack.ai/riwaj/College_player_data

Image for post

Users can also search for a specific dataset at the filter widget

dstack is built for collaboration

Besides having the ability to publish interactive datasets and visualizations, several other functionalities in dstack are built in a way to offer collaborative data exploration.

The published datasets and visualizations can be shared easily with others via a URL, i.e., even a non-technical user can view and access the published stacks on the front-end application hosted at dstack.ai using a web-browser. The publications do not have a multitenancy problem, i.e., several concurrent users can interact and collaborate on your published reports at the same time without being able to modify the code.

dstack allows users to comment on the stacks and get involved in discussions which greatly improves collaboration. Only logged in users are able to comment on the stacks, the permissions of which is managed by the stack creator. All of the features are also accessible via mobile devices.

The access control for shares and comments works in the following way

Stacks are public by default but can be made private via settings.

Image for post

Once a stack is published, the stack creator can choose whom to share the stack. Alternatively, the stack can also be made public even if the default setting is chosen as private.

Image for post

You do not need a dstack account to view a publicly shared stack, but you will need one to comment

You will see both the self-created stacks and the ones shared to you by others if you are logged in dstack.

Image for post

Overview of stacks in dstack account consisting of both self-created and those shared by others

All the basic publishing services is and will always remain free

We are working on exciting features around collaborative data exploration for data scientists. Our vision is to simplify the operational and engineering aspects of data science by offering simple tools accessible to everyone. Please view our product roadmap here as well as vote for features that you would like to see in the future.

https://trello.com/b/CJOnEjrr/public-roadmap

Please sign up and come back to us with feedback and suggestions. We are very keen to hear your thoughts.

Thank you very much.

Sign up: https://dstack.ai/auth/signup

Learn more https://dstack.ai

Documentation: docs.dstack.ai

Email for feedback: team@dstack.ai

No Comments Yet