HuggingFace Datasets

2 min readFeb 1, 2023

The most common task of NLP is Text Classification. It has various applications, such as tagging customer feedback into categories or routing support tickets into respective languages, etc.

Sentiment Analysis is another common type of NLP application. Sentiment Analysis gives the polarity of a given text.

So, NLP has various applications in a wide range of fields. In this article, I am explaining a glimpse of HuggingFace, and how we can use this in NLP applications etc.

The main building blocks in HiggingFace are Datasets, Tokenizers and Transformers.

Lets look into Datasets in HuggingFace,

I tried to import datasets from Huggingface, but it gave me error saying that No module names ‘datasets’. Then I remembered that, I don’t have datasets from HuggingFace Hub in my system. So, I installed datasets using PIP.

Pip install datasets

After pip install, below code I executed,

from datasets import list_datasets

all_datasets = list_datasets()

print (f”There are {len(all_datasets)} datasets available on HuggingFace Hub”)

print (f”First 5 are : {all_datasets[:5]}”)

Output:

There are 20138 datasets available on HuggingFace Hub

First 5 are : [‘acronym_identification’, ‘ade_corpus_v2’, ‘adversarial_qa’, ‘aeslc’, ‘afrikaans_ner_corpus’]

So, HuggingFace has 20138 datasets, and I was interested to see first 5 datasets. Those can be seen above.

If we have to load each dataset, we can load it using load_dataset function.

from datasets import load_dataset

aeslc = load_dataset(“aeslc”)

print (aeslc)

Ouput:

DatasetDict({

train: Dataset({

features: [‘email_body’, ‘subject_line’],

num_rows: 14436

})

validation: Dataset({

features: [‘email_body’, ‘subject_line’],

num_rows: 1960

})

test: Dataset({

features: [‘email_body’, ‘subject_line’],

num_rows: 1906

})

Output is a dictionary, with each key corresponding to a different split.

Train can be accessed by accessing train key and its length and its content can be easily accessed.

train_ds = aeslc[“train”]

print (train_ds)

print (len(train_ds))

print (train_ds[0])

Output:

14436

{‘email_body’: ‘Greg/Phillip, Attached is the Grande Communications Service Agreement.\nThe business points can be found in Exhibit C. I Can get the Non-Disturbance

agreement after it has been executed by you and Grande.\nI will fill in the Legal description of the property one I have received it.\nPlease execute and send to: Grande Communications, 401 Carlson Circle, San Marcos Texas, 78666 Attention Hunter Williams.\n<<Bishopscontract.doc>>\n’, ‘subject_line’: ‘Service Agreement’}

If the dataset is not available in HuggingFace Hub, if it is available in your laptop or in local system, then it can be loaded as below.

Once the dataset is loaded, its better to convert that into a Pandas Dataframe. Set_format function allows us to change the output format of dataset. The format can be removed using reset_format function.

import pandas as pd

aeslc.set_format(type=”pandas”)

aeslc.reset_format()

Rest of the topics, Tokenizers and Transformers will be explained in other articles. Because it may be long. Thank you so much for reading.

HuggingFace Datasets

Output:

There are 20138 datasets available on HuggingFace Hub

First 5 are : [‘acronym_identification’, ‘ade_corpus_v2’, ‘adversarial_qa’, ‘aeslc’, ‘afrikaans_ner_corpus’]

Ouput:

DatasetDict({

train: Dataset({

features: [‘email_body’, ‘subject_line’],

num_rows: 14436

})

validation: Dataset({

features: [‘email_body’, ‘subject_line’],

num_rows: 1960

})

test: Dataset({

features: [‘email_body’, ‘subject_line’],

num_rows: 1906

})

})

Output:

14436

{‘email_body’: ‘Greg/Phillip, Attached is the Grande Communications Service Agreement.\nThe business points can be found in Exhibit C. I Can get the Non-Disturbance

Written by MaheswaraReddy

No responses yet