How to do right the most important step in ad-hoc data analysis

Subscribe to our newsletter and never miss any upcoming articles

If a few decades ago data analysis was a prerogative of universities and private research groups, today even non-tech commercial companies start massively using it to build better products and improve processes. With the rise of technologies, data analysis has become both affordable and critical for companies to stay competitive.

There are tons of tools in the market for different kinds of analysis. This includes tools that are either tailored for specific needs or those offered out-of-the box for a variety of use-cases and industries.

Ad-hoc analysis is normally aimed at answering specific situational questions or finding new patterns in the data that might lead to important insights. As the questions are not typical or the pattern in data is unknown, the analysis requires human attention. Normally the analysis includes a data analyst or a data scientist.

In this article, I’ll be talking about the ad-hoc data analysis that is done with programming languages such as Python or R.

Due to the availability of tools and libraries for data analysis and manipulation, Python and R are getting especially popular today across the community of data scientists. Both Python and R offer interactive environments such as notebooks for data manipulation as well as the packages for data wrangling, visualization, and training machine learning models. Most of these tools are open sourced and do not require any special background to use (except maybe statistical or mathematical background). All of these reasons make these tools very helpful for ad-hoc data analysis.

Typical workflow of ad-hoc data analysis

A typical workflow for ad-hoc data analysis with Python or R consists of the following simple steps:

1. Getting data

An ad-hoc analysis starts with getting the data that has to be analysed. The data can be given by someone or can be acquired from external data sources.

2. Exploring data

Once the data is obtained, it usually needs exploration. The step is critical to understand the nature of the data, its format, its completeness, and even correctness. Methods of exploring data may include data wrangling, data aggregation, and data visualization.

3. Processing data

Often, in order to answer specific questions, the data must be first processed. Once data is processed, e.g. cleaned, filtered, and enriched, aggregated, it’s a lot easier and faster to work.

In general, the steps of exploring and processing data can repeat until the analysis is over.

Most often, boths steps are done using interactive tools. In Python, it’s for example Jupyter notebooks. In R, it’s RStudio.

Sometimes, the step of processing data can be also run in batch on large datasets, e.g. using scripts.

4. Sharing results

Regardless of the type of data, the tools used to analyse, and the purpose of the analysis, the end outcome of any data analysis is to get insights and share it with other people. Sometimes, the end result of data analysis can be a simple answer such as a Yes or a No or just a number. Often it can also be another dataset or a visualization or even a machine learning model.

Why collaboration matters

When it comes to collaboration with other data scientists, sharing prepared datasets is very important as it helps other colleagues to save time on acquiring and processing data, and getting insights faster.

Due to how our brain works, visualization is a tool to present an answer to a complex question with others. Also, in addition to answering specific questions, visualizations can tell stories — e.g. trigger new ideas about the researched topic.

Needless to say, sharing the end result is a critical part of the whole process of ad-hoc data analysis. Ad-hoc data analysis is expensive as you need data and data scientists, and time. If the results of the analysis are not properly shared with the team, all this time and money spent on data analysis is wasted. It is even worse (than wasting time and money) when you cannot apply the results to improve the product or the processes to stay competitive.

It is also easy to be mistaken that sharing the end results is the end goal. In fact, it is the first one. Sharing only ensures that there is an exchange of feedback between the data science team and the clients of the research.

How dstack.ai fits in here

While today there are a lot of tools that help at doing data analysis itself as well as for tracking tasks for data analysis, there are very few tools that help organize the collaboration between data scientists and the end clients of the research. In tech companies, data scientists use tools such as notebooks with code and outputs or custom web applications that are built and hosted by themselves. Non-tech companies, on the other hand, don’t have specific tools and rather rely on email exchanges, file storages or issue trackers as tools for collaboration.

Because of the complexity or imperfection of these tools, it is often difficult to collaborate effectively or find any needed result after some time.

This is where dstack.ai steps in by offering the missing tool to better organize the data analysis process and let teams collaborate easier, and perform meaningful work faster. Most importantly, instead of substituting an existing tool, dstack.ai offers a tool that can be used together with other tools.

The dstack.ai tool consists of two parts:

  1. The dstack package for Python (PyPi) and R (CRAN)
  2. A web application (https://dstack.ai)

The dstack packages for Python and R offer functions for publishing any data analysis results — which can include both datasets and visualizations.

Installing the dstack package is possible via conda:

conda install dstack -c dstack.ai

Or using pip:

pip install dstack

Once the package is installed, you have to configure it by invoking the dstack command line (that is installed with the package):

dstack config --user <user> --token <token>

The user and token are obtained by signing up at https://dstack.ai. The token is used for authorization and to ensure the secure access to your published data.

In order to publish a dataset or a visualization, you only need to choose a name (a stack name), and pass a pandas dataframe, a matplotlib, Bokeh, or Plotly figure.

Here’s an example of publishing a pandas dataframe:

import pandas as pd
import numpy as np
from dstack import push_frame

dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))

push_frame("pandas_example", df, "My first dataset")

The published dataframe will be available at _dstack.ai<user>/pandasexample:

Image for post

With the upcoming update of the dstack package, you’ll be able to download the published dataset back as pandas dataframe using the following code:

df = pd.read_csv(pull_attach("pandas_example"))

Publishing a visualization, be it a ggplot2, matplotlib, Bokeh or Plotly, works the same way. Here’s an example for Plotly:

import plotly.express as px
from dstack import create_frame

df = px.data.gapminder().query("country=='Canada'")
fig = px.line(df, x="year", y="lifeExp")

push_frame("simple_plotly", fig, "Life expectancy in Canada")

The published chart will be available at _dstack.ai<user>/simpleplotly:

Image for post

Note, the dstack package allows you to publish multiple datasets and charts at once each with its unique parameter values. Here’s an example:

import plotly.express as px
from dstack import create_frame

frame = create_frame("life_expectancy")
countries = ["United States", "Canada", "France", "Japan"]

for country in countries:
    df = px.data.gapminder().query(f"country=='{country}'")
    fig = px.line(df, x="year", y="lifeExp")
    frame.commit(fig, f"Life expectancy in {country}", 
        {"Country": country})

frame.push()

All the published charts for every country is available at _dstack.ai<user>/lifeexpectancy in a form of an interactive dashboard where the user is able to see the chart that corresponds to any selected country:

Image for post

Here’s an public example of a published chart available made to monitor the new cases of COVID-19 by countries: https://dstack.ai/cheptsov/covid19/speed_by_country_all. The code that publishes this dashboard can be found at https://github.com/dstackai/dstack-tutorials-py/blob/master/covid-19-speed.ipynb.

One great thing about the Python and R dstack packages that they can be used from anywhere, — be it a notebook, a script, a job, or even an application.

The full documentation on how to use the dstack packages for Python and R is available at docs.dstack.ai.

Once you’ve signed up for https://dstack.ai, you can browse all published charts, manage permissions — e.g. make all stacks public or share them with selected users. If you only access stacks shared by others, you’ll be able to see and access the stacks that were shared specifically with you:

Image for post

If a stack is public or shared with you, you can comment on it to exchange the feedback.

If you update any stack, https://dstack.ai keeps track of every revision and let you browse all revisions and rollback to any of them:

Image for post

Last but not least, all stacks are available from a mobile phone through the adaptive version of the website. This is convenient if you review updated results from your phone.

Final thoughts

As you see from the examples above, dstack.ai can be used from your notebooks, scripts, jobs or even applications to publish every revision of your intermediate or final data analysis results, and share them with your team in a secure way to exchange feedback.

While dstack.ai is quite young, its team is actively working on building a lot of new features. You can check the public roadmap, and provide me with your feedback and upvote those features that you’re interested in the most.

Are you interested? Try now! The service is fully free to use for anyone.

Useful links

That’s it for today! In future, we will write more articles with detail on how the tool compares with respect to others. I hope you enjoyed the article andare looking forward to your feedback. Please share your experience on doing ad-hoc data analysis and what tools you find the most useful!

No Comments Yet