Advanced Techniques
Advanced Pandas Techniques for Data Processing and Performance
1. Data chunking
There are often scenarios we encounter where the size of the data is more than the available memory (RAM) we have. In such cases, itโs a good idea to read data in chunks from the file so that the system doesnโt run out of memory.
1 # Initialize an empty list to store the chunks
2 private_rooms = []
3
4 # Read the CSV file in chunks
5 for chunk in pd.read_csv('data/listings.csv', chunksize=1000):
6 # Process each chunk (for example, filter listings where room_type is "Private room")
7 processed_chunk = chunk[chunk["room_type"] == "Private room"]
8
9 # Append the processed chunk to the list
10 private_rooms.append(processed_chunk)
11
12 # Combine all processed chunks into a dataframe
13 private_rooms = pd.concat(private_rooms)
2. Progress bar
Pandasโ apply() functions, which are commonly used to perform element- wise operations on DataFrame columns or rows.
1 import time
2 from datetime import datetime
3
4 from tqdm import tqdm
5 tqdm.pandas() # add this to integrate progress bar functionality to pandas
6
7
8 def convert_to_datetime(date):
9 # return date if type of date is not string
10 if type(date) != str:
11 return date
12
13 # time.sleep() added to demonstrate the progress bar more effectively
14 time.sleep(0.1)
15
16 # returns a datetime object
17 return datetime.strptime(date, "%Y-%m-%d")
18
19 private_rooms["last_review"] = private_rooms["last_review"].progress_apply(convert_to_datetime)
3. Populate Multiple Columns
Solution: apply() with result_type="expand"
1 def get_date_and_month_name(last_review_date):
2 # returns the day and month
3 return last_review_date.day_name(), last_review_date.month_name()
4
5
6 private_rooms[["Day", "Month"]] = private_rooms.apply(
7 lambda x : get_date_and_month_name(x["last_review"]),
8 axis=1,
9 result_type="expand"
10 )
4. Parallel Processing
Use multiprocessing module. (using np.array_split()) into multiple chunks (depending on the number of cores available in your system) Once the chunks are processed and the desired output for each chunk is obtained, it can be concatenated back to a single DataFrame.
1 import time
2 import random
3 from multiprocessing import Pool
4
5 import numpy as np
6 import pandas as pd
7 from tqdm import tqdm
8
9 tqdm.pandas()
10
11
12 def predict_sentiment(review):
13 time.sleep(0.1) # simulating time taken by prediction model
14 return "Positive" if random.randint(1,10) > 5 else "Negative"
15
16
17 def batch_predict_sentiment(review_df):
18 review_df["sentiment"] = review_df["comments"].progress_apply(predict_sentiment)
19 return review_df
20
21
22 def fetch_sentiment_for_review():
23 n_cores = 64
24 reviews = pd.read_csv("data/reviews.csv")
25 review_batches = np.array_split(reviews, n_cores) # split into same number of batches as n_c
26
27 # Processing Parallely
28 with Pool(n_cores) as pool:
29 sentiment_prediction_batches = pool.map(batch_predict_sentiment, review_batches)
30
31 # Once all the batches are processed, concatenate list of DataFrames into a single DataFrame
32 sentiment_prediction = pd.concat(sentiment_prediction_batches)
33 return sentiment_prediction
34
35
36 reviews_with_sentiment = fetch_sentiment_for_review()
From more than 13 hours to less than 13 minutes! Each batch consists of 7521 reviews and there are a total of 64 batches. In this scenario
Tip: set n_cores more than the actual number of cores my system has. This is because during the execution of time.sleep(0.1) the CPU remains idle and each process interleaves for other process to execute. If your process is CPU intensive, it is recommended to keep n_cores less than the actual number of cores your system has.
5. Complex Merging
Merging is quite a common operation performed by individuals who deal with data. However, sometimes it can get quite complicated to understand if any particular data points were lost during the merging process. It might be due to a plethora of reasons โ the worst one being, malformed or faulty data.
The indicator=True. When enabled it creates a new column named _merge which can denote three different scenarios based on the type of merge operation performed.
- left_only โ indicates that the rowโs key only exists in the left DataFrame and it couldnโt find a match in the right DataFrame
- right_only โ indicates that the rowโs key only exists in the right
- both โ indicates that the rowโs key exists in both the DataFrames and data
merged_df = listings.merge(
reviews,
left_on="id",
right_on="listing_id",
how="outer",
indicator=True
)
merged_df.shape
6. Data Segmentation
pd.cut() is a powerful function that can be used when you need to segment data into multiple bins. It can also act as a way to convert continuous values into categorical values. One such scenario is demonstrated in the example below. We will be Segmenting the price of each listing into multiple price brackets (bins). We can set a predetermined number of price brackets โ โ$0 โ $100โ, โ$101 โ $250โ, โ$251 โ $500โ, โ$500 โ $1000โ, and โ$1000+โ.
Create bins and label for each bin
bins = [0, 100, 250, 500, 1000, float('inf')]
labels = ["$0 - $100", "$101 - $250", "$251 - $500", "$500 - $1000", "$1000+"]
listings["price_bucket"] = pd.cut(listings["price"], bins=bins, labels=labels)
Please note here that the number of labels (5) is less than number of bins (6) by one. This is because the initial two values in the bin belong to the first label.
7. Cross-Tabulation Analysis
Using the above data, we can go one step further and perform a cross- tabulation between the price brackets and room types available for those price brackets. Hereโs where pd.crosstab() comes into play. It can be used to perform a simple cross-tabulation between price_bucket and room_type.
room_type_breakup = pd.crosstab(
listings["price_bucket"],
listings["room_type"],
margins=True # This will add the row and column totals
)