How Can I Merge Multiple Excel Files Automatically in Python?


You're at the end of the quarter. Your manager asks for a summary sales report. The problem? The data isn't in one neat file. It’s spread across a dozen different Excel spreadsheets: Sales_Jan.xlsx, Sales_Feb.xlsx, Sales_Mar.xlsx, and so on.

Your heart sinks. You know what this means: hours of tedious, soul-crushing manual labor. Opening each file, carefully copying the data, pasting it into a master sheet, and praying you don’t miss a row or paste over existing data.

This is a classic automation problem, and it's one that Python can solve in seconds, saving you hours of work and eliminating the risk of human error.


The Goal: From Many Files to One Master File

Our objective is simple. We have a folder full of Excel files with a similar structure, and we want to combine them into a single, consolidated Excel file for analysis or reporting.

For this to work smoothly, there are two prerequisites:

  1. All the Excel files should have the same column structure (e.g., "Date", "Product", "Amount").

  2. All the files should be located in the same folder.

The Tools for the Job: pandas and glob

We'll use two simple but powerful Python libraries to accomplish this.

  1. pandas: As we've seen, this is our data manipulation powerhouse. We'll use it to read each Excel file and then to combine them.

  2. glob: This is a new library for our series, and it's incredibly useful. Its job is to find files on your computer that match a specific pattern. Think of it as a file search bar for your Python script.

The Step-by-Step Code Solution

Let's assume you have a folder named monthly_reports on your computer, and inside it are your Excel files (Jan.xlsx, Feb.xlsx, etc.).

Step 1: Import the Libraries

First, we need to import pandas to work with the data and glob to find the files.

Python
import pandas as pd
import glob

Step 2: Find All the Excel Files

Next, we'll use glob to create a list of all the Excel files in our target folder. The * is a wildcard that means "match anything." So, *.xlsx means "match any file that ends with .xlsx".

Python
# Define the path to the folder
path = "C:/Users/YourUser/Documents/monthly_reports"

# Use glob to get a list of all excel files in the folder
excel_files = glob.glob(path + "/*.xlsx")

print(excel_files)
# Output might look like:
# ['C:/Users/YourUser/Documents/monthly_reports\\Jan.xlsx', 'C:/Users/YourUser/Documents/monthly_reports\\Feb.xlsx', ...]

Step 3: Loop, Read, and Collect

Now that we have a list of all our file paths, we'll create an empty list to hold our data. Then, we'll loop through each file path, read the Excel file into a pandas DataFrame, and add that DataFrame to our list.

Python
# Create an empty list to store the individual DataFrames
all_data = []

for file in excel_files:
    # Read each Excel file into a DataFrame
    df = pd.read_excel(file)
    # Add the DataFrame to our list
    all_data.append(df)

Step 4: Combine into One Master DataFrame

This is the magic step. We use the pd.concat() function to take our list of individual DataFrames (all_data) and stack them on top of each other into one single, master DataFrame.

Python
# Concatenate all the DataFrames in the list into one
master_df = pd.concat(all_data, ignore_index=True)

The ignore_index=True part is important; it re-indexes the new master DataFrame so the row numbers are continuous.

Step 5: Save the Master File

Finally, as we learned in our post on [Why Your Python Script Isn't Updating Excel Cells], we must save our new in-memory DataFrame back to a file on our disk.

Python
# Save the master DataFrame to a new Excel file
master_df.to_excel("Master_Report_Q1.xlsx", index=False)

print("All Excel files have been merged successfully into Master_Report_Q1.xlsx!")

The Complete Script

Here is the entire script, which can merge hundreds of files in seconds.

Python
import pandas as pd
import glob

# Define the path to the folder containing the Excel files
path = "C:/Users/YourUser/Documents/monthly_reports"

# Use glob to get a list of all excel files
excel_files = glob.glob(path + "/*.xlsx")

# Create an empty list to store DataFrames
all_data = []

# Loop through the list of files and read each one into a DataFrame
for file in excel_files:
    df = pd.read_excel(file)
    all_data.append(df)

# Concatenate all DataFrames into one
master_df = pd.concat(all_data, ignore_index=True)

# Save the merged DataFrame to a new Excel file
master_df.to_excel("Master_Report_Q1.xlsx", index=False)

print("Merge complete! Master report saved.")

Frequently Asked Questions (FAQs)

1. What if I only want to merge files that start with "Sales_"?

You can make your glob pattern more specific. For example: glob.glob(path + "/Sales_*.xlsx").

2. What if my files are in different formats, like CSV and Excel?

You would need to handle this with more advanced logic, perhaps running glob twice with different patterns (*.xlsx and *.csv) and using the appropriate read function (pd.read_excel or pd.read_csv) for each.

3. What happens if the columns are in a different order in some files?

pd.concat() is smart! As long as the column names are the same, it will correctly align the data, even if the column order is different between files.

Conclusion: You've Built a Data Pipeline

You've just learned one of the most practical and time-saving skills in data automation. By combining glob to find files and pandas to merge them, you've created a simple but powerful data pipeline. This script can be run again and again, saving you countless hours and ensuring your data consolidation is always fast, accurate, and effortless.

Comments

Popular posts from this blog

Python's Hardest Step: A Simple Guide to Your Dev Environment

Why Can't I Add a String and a Number in Python?

Why Isn't My if Statement Checking All Conditions with and/or?