Generate Data

Modules

faker In this post we will introduce several must-know Pandas methods for effective data exploration.

Pandas Functions

Faker

TODO: Use stock data to demonstrated this.

Although not part of pandas, we need a dataset to demonstrate various pandas functions.

Create a CSV sample dataset using faker.

import random
import csv
from faker import Faker

# Initialize Faker

fake = Faker()

# List of products and their categories

products = [
    {"name": "Laptop", "category": "Electronics", "price": 899.99},
    {"name": "Smartphone", "category": "Electronics", "price": 699.99},
    {"name": "Headphones", "category": "Accessories", "price": 49.99},
    {"name": "Coffee Maker", "category": "Home Appliances", "price": 79.99},
    {"name": "Sneakers", "category": "Fashion", "price": 59.99},
    {"name": "Backpack", "category": "Fashion", "price": 39.99},
    {"name": "Blender", "category": "Home Appliances", "price": 99.99},
    {"name": "Desk Chair", "category": "Furniture", "price": 129.99},
    {"name": "Water Bottle", "category": "Accessories", "price": 19.99},
    {"name": "Notebook", "category": "Stationery", "price": 5.99},
]

# Define a function to generate order data

def generate_order_data(num_rows):
    data = []
    for_ in range(num_rows):
        product = random.choice(products)
        quantity = random.randint(1, 10)
        total = round(product["price"] * quantity, 2)
        order = {
            "Order ID": fake.uuid4(),
            "Customer Name": fake.name(),
            "Customer Email": fake.email(),
            "Product Name": product["name"],
            "Category": product["category"],
            "Quantity": quantity,
            "Price": product["price"],
            "Total": total,
            "Order Date": fake.date_this_year(),
            "Shipping Address": fake.address(),
        }
        data.append(order)
    return data

# Generate 1000 rows of data

num_rows = 1000
order_data = generate_order_data(num_rows)

# Save the data to a CSV file

output_file = "sample_orders.csv"
with open(output_file, mode="w", newline="", encoding="utf-8") as file:
    writer = csv.DictWriter(file, fieldnames=order_data[0].keys())
    writer.writeheader()
    writer.writerows(order_data)
print(f"Sample dataset with {num_rows} rows has been saved to '{output_file}'.")

This code ensures that the sample dataset is saved as a structured CSV file (sample_orders.csv) for further data analysis by pandas functions. Head function head() head(): Used to preview the top rows of the sample dataset.

import pandas as pd

# Read the sample_orders.csv file into a Pandas DataFrame

df = pd.read_csv("sample_orders.csv")

# Display the first 10 rows of the dataset

print(df.head(10))

Head Tail

head(): Use to preview the bottom rows of the sample dataset. tail(): Use to preview the bottom rows of the sample dataset.

import pandas as pd

# Read the sample_orders.csv file into a Pandas DataFrame

df = pd.read_csv("sample_orders.csv")

# Display the last 10 rows of the dataset

print(df.head(10))
print(df.tail(10))

Sample

sample(): This function is highly valuable when working with large datasets. When we need to extract and analyze a smaller subset from a larger DataFrame, sample() helps efficiently retrieve random samples, enabling preliminary data exploration or performance evaluation.

import pandas as pd

# Read the sample_orders.csv file into a Pandas DataFrame

df = pd.read_csv("sample_orders.csv")

# Read and display the random 10 rows from the dataset

print(df.sample(10))

Info

Information function info() info(): This function provides a summary of the dataset, including the number of entries, column names, data types, and memory usage.

import pandas as pd

# Read the sample_orders.csv file into a Pandas DataFrame

df = pd.read_csv("sample_orders.csv")

# Display a summary of the dataset

print(df.info())

Describe

describe(): This function provides basic statistical information about the dataset, such as mean, standard deviation, minimum and maximum values, and quartiles.

import pandas as pd

# Read the sample_orders.csv file into a Pandas DataFrame

df = pd.read_csv("sample_orders.csv")

# Display the basic statistical information about the dataset

print(df.describe())

Value Counts

Value counts function value_counts() value_counts(): This method returns the count of all unique values in a column or a pandas Series.

import pandas as pd

# Read the sample_orders.csv file into a Pandas DataFrame

df = pd.read_csv("sample_orders.csv")

# Display the count of all unique values in a column,such as "Category"

print(df["Category"].value_counts())

Shape

shape: This attribute returns the number of rows and columns in the dataset.

import pandas as pd

# Read the sample_orders.csv file into a Pandas DataFrame

df = pd.read_csv("sample_orders.csv")

# Display the number of rows and columns in the dataset

print(df.shape)
~~~~

#### Dtypes

Dataframe dtypes attribute
df.dtypes: This attribute returns the data types of all columns.

~~~python
import pandas as pd

# Read the sample_orders.csv file into a Pandas DataFrame

df = pd.read_csv("sample_orders.csv")

# Display the data types of all columns

print(df.dtypes)

Unique

Unique function unique() unique(): This method returns all unique values in a column or a pandas Series.

import pandas as pd

# Read the sample_orders.csv file into a Pandas DataFrame

df = pd.read_csv("sample_orders.csv")

# Display all unique values in a column

print(df["Category"].unique())

Non-unique

Nunique function nunique() nunique(): This function returns the number of unique values in a DataFrame.

import pandas as pd

# Read the sample_orders.csv file into a Pandas DataFrame

df = pd.read_csv("sample_orders.csv")

# Display the count of unique values in the dataset, sorted in descending order

df.nunique().sort_values(ascending=False)