Data Aggregation: 5 to 30/60 minute#

Prior to 2018, the 30- and 60-minute PSM3 data reflected “instantaneous” values for nominal 4 km pixels. Starting in 2018 when higher resolution (5 minute, 2 km) GOES imagery became available, the 30- and 60-minute values switched to being spatial and temporal averages of the higher-resolution 5-minute data.

This notebook recreates the aggregation performed to calculate the 30- and 60-minute values from the 5-minute values.

Starting with the 30-minute values, the spatial average is easy: each 4 km pixel comprises four 2 km pixels, so get data for the four constituent pixels and average them together. The temporal average is slightly complicated in that, because the 30-minute values are symmetrical averages of seven 5-minute values, some 5-minute values contribute to two separate interval averages. Here’s an ASCII illustration of this concept:

Timestamps: 00 05 10 15 20 25 30 35 40 45 50 55 00 05 10 15 20 25
Interval 1:          --------------------
Interval 2:                            --------------------

The 5-minute value at minute 45 gets incorporated into the average for both 30-minute intervals.

Note that this method does not perfectly recreate the 30-minute values, and I’m not sure that this is actually what happens in the real NSRDB. But it does get pretty close.

[1]:
import pvlib
import matplotlib.pyplot as plt
import pandas as pd
[2]:
lat, lon = 40.005, -80.005

# Be aware that `get_psm3()` uses a different endpoint for interval=5 than it does for interval=30 or 60

kwargs = dict(latitude=lat, longitude=lon, api_key='DEMO_KEY', email='assessingsolar@gmail.com',
              names='2018', map_variables=False, leap_day=True)
[3]:
df_4km30, meta = pvlib.iotools.get_psm3(interval=30, **kwargs)
print('4 km pixel coords:', meta['Latitude'], meta['Longitude'])  # 40.01 -80.02
4 km pixel coords: 40.01 -80.02
[4]:
df_2km1, meta = pvlib.iotools.get_psm3(interval=5, **kwargs)
print('2 km pixel coords:', meta['Latitude'], meta['Longitude'])  # 40.00 -80.01
2 km pixel coords: 40.0 -80.01

So the center of the outer 4 km pixel is North and West of the center of the inner 2 km pixel. That means the other three inner 2km pixels in this 4 km pixel are North, West, and North-West of this one. Since the 2 km pixels are 0.02 degrees on a side, that means stepping 0.02 degrees will get us the neighbor pixels:

[5]:
kwargs['latitude'] = lat
kwargs['longitude'] = lon - 0.02  # one pixel west
df_2km2, meta = pvlib.iotools.get_psm3(interval=5, **kwargs)
print('2 km pixel coords:', meta['Latitude'], meta['Longitude'])  # 40.00 -80.03
2 km pixel coords: 40.0 -80.03
[6]:
kwargs['latitude'] = lat + 0.02  # one pixel north
kwargs['longitude'] = lon
df_2km3, meta = pvlib.iotools.get_psm3(interval=5, **kwargs)
print('2 km pixel coords:', meta['Latitude'], meta['Longitude'])  # 40.02 -80.01
2 km pixel coords: 40.02 -80.01
[7]:
kwargs['latitude'] = lat + 0.02  # one pixel north-west
kwargs['longitude'] = lon - 0.02
df_2km4, meta = pvlib.iotools.get_psm3(interval=5, **kwargs)
print('2 km pixel coords:', meta['Latitude'], meta['Longitude'])  # 40.02 -80.03
2 km pixel coords: 40.02 -80.03

Now, implement the spatiotemporal aggregation method described above. I think this is about as clean as it gets when using pandas, but please let me know if there’s a better way.

[8]:
def aggregate30(x1, x2, x3, x4):
    avg_5min = pd.concat([x1, x2, x3, x4], axis=1).mean(axis=1)
    resampler = avg_5min.shift(3).resample('30T')
    sum_30min_6samples = resampler.sum()
    avg_30min_7samples = (sum_30min_6samples + resampler.first().shift(-1)) / 7
    return avg_30min_7samples
[9]:
columns = ['GHI', 'DNI', 'DHI']
fig, axes = plt.subplots(2, len(columns), figsize=(9, 6))

for i, column in enumerate(columns):
    x = df_4km30[column]
    y = aggregate30(df_2km1[column], df_2km2[column], df_2km3[column], df_2km4[column])
    axes[0, i].scatter(x, y, s=1)
    axes[0, i].set_xlabel(f'PSM3 {column} [W/m2]')
    axes[0, i].set_ylabel(f'Calculated {column} [W/m2]')

    axes[1, i].scatter(x, x - y, s=1)
    axes[1, i].set_ylabel(f'{column} Difference (PSM3 $-$ Calculated) [W/m2]')

fig.tight_layout()
../_images/pages_data-aggregation_11_0.png

The 60-minute values are dead simple: they’re actually just every other 30-minute value. Therefore, they actually reflect the average only over the 30 minutes centered on the timestamp, not the entire hour. Here’s an ASCII illustrattion:

Timestamps: 10 15 20 25 30 35 40 45 50 55 00 05 10 15 20 25 30 35 40 45 50
Interval 1:    --------------------
Interval 2:                                        --------------------

Other than simplicity and not having to store a separate dataset, it’s not clear to me why this is a good idea – I’d have thought that extending the above approach (average of thirteen 5-minute values) would make more sense.

Regardless, let’s show that it’s true:

[10]:
df_4km60, meta = pvlib.iotools.get_psm3(interval=60, **kwargs)
print('4km cell coords:', meta['Latitude'], meta['Longitude'])  # 40.01 -80.02
4km cell coords: 40.01 -80.02
[11]:
columns = ['GHI', 'DNI', 'DHI']
fig, axes = plt.subplots(2, len(columns), figsize=(9, 6))

for i, column in enumerate(columns):
    x = df_4km60[column]
    y = df_4km30[column][1::2]  # take every other 30-minute value
    axes[0, i].scatter(x, y, s=1)
    axes[0, i].set_xlabel(f'PSM3 {column} [W/m2]')
    axes[0, i].set_ylabel(f'Calculated {column} [W/m2]')

    axes[1, i].scatter(x, x - y, s=1)
    axes[1, i].set_ylabel(f'{column} Difference (PSM3 60 $-$ PSM3 30) [W/m2]')

fig.tight_layout()
../_images/pages_data-aggregation_14_0.png
[12]:
%load_ext watermark
%watermark --iversions -u -d -t
Last updated: 2022-09-22 21:27:07

pvlib     : 0.9.3
pandas    : 1.5.0
matplotlib: 3.5.2

[ ]: