Setting Up Your First Machine Learning Project with DAGsHub

Introduction to DAGsHub

This guide will demonstrate how to initiate a Machine Learning project using DAGsHub, a platform tailored for enhancing collaboration on Data Science projects that involve data versioning and model development.

In this article, we will explore:

Challenges associated with traditional Git and DVC configurations for Machine Learning projects and their limitations.
An introduction to DAGsHub and how it addresses these issues.
A step-by-step guide to setting up your first project using DAGsHub and its storage capabilities.

Challenges with Conventional Git and DVC Configurations

Collaboration in Machine Learning projects often relies on Git, a practice borrowed from software engineering. However, Machine Learning projects differ significantly because they frequently involve large data files that require regular updates, which Git struggles to manage effectively.

Though many Machine Learning practitioners have managed to adapt Git for their needs, it was never the ideal solution. As a workaround, some have turned to DVC (Data Version Control), which still utilizes Git but stores data files in remote cloud systems like S3, GDrive, or Azure.

While this approach simplifies collaboration on large data projects, it still necessitates a complex setup to integrate external storage with DVC.

What is DAGsHub and DAGsHub Storage?

DAGsHub is a free tool designed for Machine Learning projects that eliminates the aforementioned problems. Built on Git and DVC, it facilitates easy data and model versioning and allows for tracking of experiments. The recent addition of DAGsHub Storage further streamlines the setup of Machine Learning projects, removing the need for extensive configuration.

This means you no longer have to invest in cloud storage options such as AWS, GCS, Azure, or GDrive for managing large datasets, nor will you face the difficulties of setting them up with DVC. With DAGsHub Storage, you can conveniently access your code, data, models, and experiments all in one location.

Exciting, right? In the following section, I will guide you through the process of establishing a machine learning project using DAGsHub and DAGsHub Storage in a few simple steps.

Setting Up Your First Project with DAGsHub

To begin, the first step is to create a DAGsHub account. You can do so by following this link.

Once registered, you can log in to your DAGsHub space, where you will create a new repository.

Fill in the repository details, including a name and brief description, while leaving the other settings at their defaults. After creating the repository, you can initialize the project from the command line:

mkdir my_first_dags_hub_project

cd my_first_dags_hub_project

git init

Your project structure isn't established yet, so let's create directories to store data and outputs:

mkdir data

mkdir outputs

We will utilize example data from a machine learning project in the official DAGsHub tutorial. Download the project requirements from requirements.txt and place it in your main directory. Then, initialize a virtual environment using the following commands:

python3 -m venv .venv

echo .venv/ >> .gitignore

echo __pycache__/ >> .gitignore

source .venv/bin/activate

Next, install the requirements in your new virtual environment:

pip install -r requirements.txt

You're now ready to add training data to the previously created data folder:

echo /data/ >> .gitignore

echo /outputs/ >> .gitignore

The commands above will download the data for your project and ensure that data and output files are excluded from Git (.gitignore entries).

Finally, commit your changes to Git:

git add .

git commit -m "Initialized my first DAGs Hub project"

git push -u origin master

At this point, your repository should appear like this:

You will see only the .gitignore file and the requirements. Let's manage the data folders using DVC. First, add a training file (main.py) to your project directory and then to the remote repository:

git add main.py

git commit -m "Adding training file"

git push -u origin master

Your project should now look like this:

This main.py file can either split the data into training and testing sets or train the machine learning models. Let's start by splitting the data (ensure your virtual environment is activated):

python main.py split

Then, proceed to train a model:

python main.py train

You should see output similar to this:

After splitting, new zip files will appear in the data folder, and the training process will generate .joblib files in the outputs folder. All of these need to be tracked by DVC, so initialize it by running:

dvc init

Now add the data and outputs for DVC tracking:

dvc add data

dvc add outputs

Then, include the corresponding information in Git:

git add .dvc data.dvc outputs.dvc

git commit -m "Added data and outputs to DVC"

Next, set up DAGsHub Storage as your DVC remote using the following commands with your credentials:

dvc remote default origin --local

dvc remote modify origin --local user <DAGsHub_username>

dvc remote modify origin --local auth basic

dvc remote modify origin --local password <DAGsHub_password>

Finally, commit the setup changes to Git:

git add .dvc/config

git commit -m "Configured the DVC remote"

git push -u origin master

dvc push --all-commits

This process will result in all files being uploaded to DAGsHub Storage.

As illustrated, all components, code, data files, and models are versioned, stored, and accessible in one location. You've successfully established your first Data Science project using DAGsHub and its storage system!

You might be curious: where are the steps for configuring DAGsHub Storage? The reality is, there are no complex setup steps required—just a few lines of code. Your data and models are automatically stored by DAGsHub Storage without any intricate configuration, making it a preferable option compared to services offered by Amazon and Google.

Additionally, the data stored in DAGsHub is easily viewable and searchable. To experience this feature, navigate to the data folder in your DAGsHub Storage project view and inspect the CrossValidated-Questions.csv file.

The CrossValidated-Questions.csv file that we utilized for training the machine learning algorithm can be accessed effortlessly within DAGsHub Storage, resembling a searchable pandas DataFrame.

DAGsHub Storage comes equipped with many productive features that can significantly enhance your machine learning project. It's also an excellent collaboration tool, enabling your entire team to work together seamlessly.

Summary

In this article, you have learned how to set up your first project using DAGsHub and DAGsHub Storage, which simplifies the process for Machine Learning projects. This innovative solution offers enhanced collaboration compared to traditional setups involving Git, DVC, and other remote storage providers.

I hope you found this information helpful, and happy coding!

arsalandywriter.com

Setting Up Your First Machine Learning Project with DAGsHub

Introduction to DAGsHub

Challenges with Conventional Git and DVC Configurations

What is DAGsHub and DAGsHub Storage?

Setting Up Your First Project with DAGsHub

Summary

Recommended Reading

Share the page:

Recent Post:

Effective Perspectives for Reviewing: A Guide to Understanding

Understanding Closures in JavaScript: A Beginner's Practical Guide

# Exploring the M2 MacBook Air: Is It the Right Choice for You?

Disruptive Innovation and Bacterial Parthenogenesis Explored

# My Encounter with a Luddite: Insights from My First Job Experience

# The Emotional Science of Crying: Understanding Our Tears

Mastering Animations Using the React-Motion Library

Exploring the Interplay Between Language and Human Thought