— 13 min read

Generating an excel report with python

Pandas pivot tables to an excel sheet.

For many data analysts and business people excel is a powerful tool for reporting. But very often excel reports become cumbersome and difficult to extend, especially when it comes to gathering data from several sources. In this post we will generate an excel report using python (pandas and openpyxl). We will also look briefly at a way to create an excel template and then generate an excel report with several sheets based on this template.

We will be looking at some startups investments data.

1 import pandas as pd
2 import numpy as np
3 
4 df = pd.read_csv('investments.csv', sep=';')

To make the report simple, we will be using only a subset of the columns in the csv file, and we will look at the data from 2007 for funding rounds of type: angel, seed, venture, private equity, and undisclosed.

1 df1 = df[['funded_year', 'funded_quarter', 'funding_round_type', 'raised_amount_usd']]
2 df1 = df1[df.funded_year >= 2007]
3 df1 = df1[df1.funding_round_type.isin(['angel', 'private_equity', 'seed', 'venture', 'undisclosed'])]

The data look like this:

funded_year funded_quarter funding_round_type raised_amount_usd
2014 2014-Q3 venture 5 956 174
2014 2014-Q3 private_equity 81 216 295
2014 2014-Q3 seed 1 300 000
2014 2014-Q3 seed NaN
2014 2014-Q2 seed 500 000
2014 2014-Q3 venture 2 500 000
2014 2014-Q3 venture 7 000 000

We notice that the data contains some undefined values, so we have to do some cleaning

1 df1 = df1.fillna(0)
2 def to_int(x):
3     try:
4         x = int(x.replace(' ', ''))
5     except (AttributeError, ValueError):
6         return 0
7 
8 df1.raised_amount_usd = df1.raised_amount_usd.map(to_int)
9 df1.raised_amount_usd = df1.raised_amount_usd.astype(int)

Now let’s create our pivot tables

1 table1 = pd.pivot_table(df1, index=['funding_round_type'],
2                columns=['funded_year', 'funded_quarter'],
3                values=['raised_amount_usd'],
4                aggfunc=[lambda x: len(x)],fill_value=0)
5 
6 table2 = pd.pivot_table(df1, index=['funding_round_type'],
7                columns=['funded_year', 'funded_quarter'],
8                values=['raised_amount_usd'],
9                aggfunc=[np.sum],fill_value=0)

This will generate a 2 pivot tables similar to this:

<count>
raised_amount_usd
funded_year 2007 2008 ... 2013 2014
funded_quarter 2007-Q1 2007-Q2 2007-Q3 2007-Q4 2008-Q1 2008-Q2 ... 2013-Q2 2013-Q3 2013-Q4 2014-Q1 2014-Q2 2014-Q3
funding_round_type
angel 97 67 50 71 122 69 ... 161 226 141 211 123 77
private_equity 16 14 22 10 13 13 ... 244 187 132 101 131 68
seed 110 88 92 111 219 143 ... 1870 2368 2097 1922 1686 1763
undisclosed 395 268 255 243 362 354 ... 710 454 647 830 304 98
venture 1294 1207 1297 1144 1440 1434 ... 1856 2065 2147 2263 3019 2905

5 rows × 32 columns

Now let’s generate an excel report with 2 sheets containing the results of the pivot tables

1 writer = pd.ExcelWriter('report.xlsx')
2 table1.to_excel(writer, 'Sheet1')
3 table2.to_excel(writer, 'Sheet2')
4 writer.save()

The result output

excel_report

To go a bit further, let’s imagine that we want to generate the report from a template. For this we will need the python module openpyxl, because it’s the only python module that allows to modify an excel file.

The first step is to create a template for our report.

excel_report

You can notice that we named an excel range values to be able to update the values in that range easily from the python side. Here’s an utility function to update an excel sheet from python given a named range or cell range.

 1 import openpyxl
 2 from openpyxl import load_workbook
 3 
 4 def update_range(worksheet, data, cell_range=None, named_range=None):
 5     """
 6     Updates an excel worksheet with the given data.
 7     :param worksheet: an excel worksheet
 8     :param data: data used to update the worksheet cell range (list, tuple, np.ndarray, pd.Dataframe)
 9     :param cell_range: a string representing the cell range, e.g. 'AB12:XX23'
10     :param named_range: a string representing an excel named range
11     """
12 
13     def clean_data(data):
14         if not isinstance(data, (list, tuple, np.ndarray, pd.DataFrame)):
15             raise TypeError('Invalid data, data should be an array type iterable.')
16 
17         if not len(data):
18             raise ValueError('You need to provide data to update the cells')
19 
20         if isinstance(data, pd.DataFrame):
21             data = data.values
22 
23         elif isinstance(data, (list, tuple)):
24             data = np.array(data)
25 
26         return np.hstack(data)
27 
28     def clean_cells(worksheet, cell_range, named_range):
29         # check that we can access a cell range
30         if not any((cell_range, named_range) or all((cell_range, named_range))):
31             raise ValueError('`cell_range` or `named_range` should be provided.')
32 
33         # get the cell range
34         if cell_range:
35             try:
36                 cells = np.hstack(worksheet[cell_range])
37             except (CellCoordinatesException, AttributeError):
38                 raise ValueError('The cell range provided is invalid, cell range must be in the form XX--[:YY--]')
39 
40         else:
41             try:
42                 cells = worksheet.get_named_range(named_range)
43             except (NamedRangeException, TypeError):
44                 raise ValueError('The current worksheet {} does not contain any named range {}.'.format(
45                     worksheet.title,
46                     named_range))
47 
48         # checking that we have cells to update, and data
49         if not len(cells):
50             raise ValueError('You need to provide cells to update.')
51 
52         return cells
53 
54     cells = clean_cells(worksheet, cell_range, named_range)
55     data = clean_data(data)
56 
57     # check that the data has the same dimension as cells
58     if len(cells) != data.size:
59         raise ValueError('Cells({}) should have the same dimension as the data({}).'.format(len(cells), data.size))
60 
61     for i, cell in enumerate(cells):
62         cell.value = data[i]

Now we are ready to generate our first report from the template

 1 import openpyxl
 2 from openpyxl import load_workbook
 3 
 4 # load the excel template
 5 workbook = load_workbook('template.xlsx')
 6 
 7 # number of funding rounds by year
 8 table1 = pd.pivot_table(df1, index=['funding_round_type'],
 9                columns=['funded_year'],
10                values=['raised_amount_usd'],
11                aggfunc=[lambda x: len(x)], fill_value=0)
12 
13 worksheet = workbook.active
14 worksheet.title = 'count_by_year'
15 update_range(workbook.active, table1.values, named_range='values')
16 # create a new excel report
17 workbook.save('report1.xlsx')

This will keep our template unmodified and generate a report with one excel sheet count_by_year containing the values from the pivot table.

excel_report

The last part of this part is to show a work around when we need to generate a report with multiple sheets based on the same template. Things get a bit complicated, because no python module allow to copy/clone a sheet and update it, I managed to find a way to do this by reloading the workbook each time we copy the template sheet, but I hope that this option will be supported by the openpyxl module in the future, for further reference there’s a ticket for this.

 1 import openpyxl
 2 from openpyxl import load_workbook
 3 from copy import copy
 4 
 5 # load the excel template
 6 workbook = load_workbook('template.xlsx')
 7 # create a new excel report
 8 workbook.save('report2.xlsx')
 9 
10 
11 # number of funding rounds by year
12 table1 = pd.pivot_table(df1, index=['funding_round_type'],
13                columns=['funded_year'],
14                values=['raised_amount_usd'],
15                aggfunc=[lambda x: len(x)], fill_value=0)
16 
17 # sum of funding rounds by year
18 table2 = pd.pivot_table(df1, index=['funding_round_type'],
19               columns=['funded_year'],
20               values=['raised_amount_usd'],
21               aggfunc=[np.sum], fill_value=0)
22 
23 def create_sheet(wb_name, sheet_name, values):
24     workbook = load_workbook(wb_name)
25     worksheet = workbook.active
26     # clone the template worksheet
27     update_range(workbook.active, values, named_range='values')
28     # add worksheet to the workbook
29     worksheet.title = sheet_name
30     workbook._add_sheet(copy.copy(worksheet))
31     # since we can't use deepcopy, we need to save the workbook and reload it
32     # otherwise we will be updating the same references of all copied sheets
33     workbook.save(wb_name)
34 
35 create_sheet('report2.xlsx', 'count_by_year', table1.values)
36 create_sheet('report2.xlsx', 'sum_by_year', table2.values)
37 # we need to remove the current template sheet
38 workbook = load_workbook('report2.xlsx')
39 workbook.remove_sheet(workbook.active)
40 workbook.save('report2.xlsx')

The output result

excel_report