Page 1 of 1

Using Python and OPeNDAP to subset large gridded datasets

PostPosted: Mon Oct 06, 2014 4:44 pm
by edward.m.armstrong
*** Please note: the updated version of this python script is available in the following data recipe [Using Python to subset large gridded datasets (v2)] ***

Dealing with large gridded datasets is challenging from the perspective of storage and access. Instead of downloading and storing large datasets locally it is often preferable to spatially subset them on a server side and store smaller pieces, or even use the pieces dynamically for an analysis and discard them. OPeNDAP is a popular service to extract subsets from data in netCDF and HDF.

Many GHRSST Level 4 datasets fit this characteristic of large volumes and the PO.DAAC OPeNDAP server (http://podaac-opendap.jpl.nasa.gov/opendap/allData/) can be used in a subsetting framework to extract subsets as needed from these granules. But OPeNDAP requires a granule (file) specific URL service call including variables and their proper grid index range to correctly perform the subsetting operation. This can be a tedious operation.

The python script opendayL4.py (ftp://podaac.jpl.nasa.gov/allData/commo ... endapL4.py) is wrapper to OPeNDAP subsetting requests for GHRSST Level 4 datasets. In its current iteration it is “tuned” to the MUR dataset (http://podaac.jpl.nasa.gov/dataset/JPL-L4UHfnd-GLOB-MUR), but can be easily modified for others. For example, to extract MUR data from 2010-01-01 to 2010-01-05 for a region bounded by -140 to -110 degrees longitude and 10 to 50 degrees latitude, this request can be made on the UNIX command line:

% opendapL4.py --start 2010 01 01 --finish 2010 01 05 --region -140 -110 10 50

The program will first determine if it can find granules for the temporal period chosen and report the subsetted output grid dimensions. It will ask for download confirmation:

~~~~ example of screen output ~~~~~~~

First file: 20100101-JPL-L4UHfnd-GLOB-v01-fv04-MUR.nc
Last file: 20100105-JPL-L4UHfnd-GLOB-v01-fv04-MUR.nc
files obtained at 1-day interval

Longitude range: -140.000000 to -110.000000
Latitude range: 10.000000 to 50.000000
every 1 pixel(s) is obtained

grid dimensions will be ( 2731 x 3641 )

OK to download? [yes or no]: yes

~~~~~~ end example ~~~~~~~~

Once the request is accepted, the subsetted granules are downloaded to the local directory via a curl or wget system call. The output format is netCDF.

The program has dependencies on the python netCDF library and the aforementioned curl and wget system commands.

To modify this program for other datasets including non-GHRSST pay careful attention to parameters() subroutine and the root path defined by variable ‘L4URL’

Re: Using Python and OPeNDAP to subset large gridded dataset

PostPosted: Mon Dec 08, 2014 3:03 pm
by rleal
Hi:

How could this script be tuned to accept also hdf data (and maybe output netcdf subsets)? For instance http://podaac-opendap.jpl.nasa.gov/open ... /chlA/4km/ files are stored in hdf format and I believe the wget/cURL subsetting syntax must be quite different.

Thanks for the post.

Ricardo

Re: Using Python and OPeNDAP to subset large gridded dataset

PostPosted: Wed May 13, 2015 10:26 am
by edward.m.armstrong
Hi Ricardo,

Its not currently usable for HDF4 data like the MODIS CHL A you would like to subset. The netCDF and HDF libraries are different.

Best to use OPeNDAP or our THREDDS servers to access and subset these.

Re: Using Python and OPeNDAP to subset large gridded dataset

PostPosted: Sun Jun 26, 2016 10:39 pm
by mariakatosvich
In this example i will show how to benefit from OPeNDAP subsetting. We will request subsets from worldwide digital elevevation maps (DEMs) that are too big to request as a whole: resulting in approx. 500 MB (over 10,000 x 20,000 points) or 2 GB (over 20,000 x 40,000) for respectively a 1 minute or 30 second grids.

Code: Select all
#!/usr/bin/env python
 
#OPENDAP_SUBSETTING_WITH_PYTHON_TUTORIAL how to benefit from OPeNDAP subsetting in python
#
# This tutorial is also available for Matlab
#
#See also: OPeNDAP_access_with_Matlab_tutorial.py
 
# $Id:  $
# $Date:  $
# $Author:  $
# $Revision:  $
# $HeadURL: https://repos.deltares.nl/repos/OpenEarthTools/trunk/python/io/opendap/OPeNDAP_subsetting_with_python_tutorial.py $
# $Keywords: $
 
# This document is also posted on a wiki: http://public.deltares.nl/display/OET/OPeNDAP+access+with+python
 
# Read data from an opendap server
import urllib
import numpy as np
import netCDF4
import pydap
import matplotlib
import matplotlib.pyplot as plt
import pylab
 
# from opendap import opendap # OpenEarthTools module, see above that makes pypdap quack like netCDF4
 
# Define data on an opendap server
# for converting non-open access GECBO grids to same netCDF structure: see https://repos.deltares.nl/repos/OpenEarthRawData/trunk/gebco/
gridurls =  ['http://geoport.whoi.edu/thredds/dodsC/bathy/etopo1_bed_g2',
             'http://geoport.whoi.edu/thredds/dodsC/bathy/etopo2_v2c.nc',
             'http://geoport.whoi.edu/thredds/dodsC/bathy/srtm30plus_v1.nc',
             'http://geoport.whoi.edu/thredds/dodsC/bathy/srtm30plus_v6',
             'http://geoport.whoi.edu/thredds/dodsC/bathy/smith_sandwell_v9.1.nc',
             'http://geoport.whoi.edu/thredds/dodsC/bathy/smith_sandwell_v11',
             r'F:\checkouts\OpenEarthRawData\gebco\raw\gebco_30sec.nc',
             r'F:\checkouts\OpenEarthRawData\gebco\raw\gebco_1min.nc']
 
lineurl  = 'http://opendap.deltares.nl/thredds/dodsC/opendap/noaa/gshhs/gshhs_i.nc';
 
# Get line data: 1D vectors are small, so we can get all data
linedata = netCDF4.Dataset(lineurl) # opendap(url_line) # when netCDF4 was not compiled with OPeNDAP
 
lines = dict(
   lon=linedata.variables['lon'][:],
   lat=linedata.variables['lat'][:]
   )
linedata.close()
 
# Define bounding box
BB = dict(
   lon=[ 0, 10],
   lat=[50, 55]
   )
 
for i,ncfile in enumerate(gridurls):
 
   print ncfile
 
   # Get grid data
   grid   = netCDF4.Dataset(ncfile) # opendap(ncfile) # when netCDF4 was not compiled with OPeNDAP
 
   # Get all data from lat and lon, but just a pointer to the topography
   # Because getting it all takes a bit too long
   lat = grid.variables['lat'][:]
   lon = grid.variables['lon'][:]
 
   # nonzero returns a tuple of idx per dimension
   # we're unpacking the tuple here so we can lookup max and min
   (latidx,) = np.logical_and(lat >= BB['lat'][0], lat < BB['lat'][1]).nonzero()
   (lonidx,) = np.logical_and(lon >= BB['lon'][0], lon < BB['lon'][1]).nonzero()
 
   #
   assert lat.shape[0] == grid.variables['topo'].shape[0], 'We assume first dim is lat here'
   assert lon.shape[0] == grid.variables['topo'].shape[1], 'We assume 2nd dim is lon here'
 
   # get rid of the non used lat/lon now
   lat = lat[latidx]
   lon = lon[lonidx]
   # Get the topography data with the new indices
   topo = grid.variables['topo'][latidx.min():latidx.max(),
                                 lonidx.min():lonidx.max()]
   title = grid.title # remember so we can close the file before plotting
   grid.close()
 
   Lon,Lat = np.meshgrid(lon, lat)
   # make a pcolormesh (the fastest way to plot simple grids)
   mesh = plt.pcolormesh(Lon,Lat,topo)
   plt.plot(lines['lon'],lines['lat'],'k')
   # some customizations (called on the axes and figure)
   mesh.axes.set_title(title)
   mesh.axes.set_aspect(1/np.cos(np.pi*np.mean(BB['lat'])/180.)) # set aspect to match meters
   plt.xlim(BB['lon'][0],BB['lon'][1])
   plt.ylim(BB['lat'][0],BB['lat'][1])
   mesh.axes.set_xlabel('lon [deg]')
   mesh.axes.set_ylabel('lat [deg]')
   plt.clim(-50,150)
   mesh.figure.colorbar(mesh) # use the mesh to make a colorbar
   # Save the figure
   mesh.figure.savefig('%s.png' % title)
   mesh.figure.clf()

Thanks
Maria.K

Re: Using Python and OPeNDAP to subset large gridded dataset

PostPosted: Tue Jun 28, 2016 1:24 pm
by yiboj
Hi Maria,

Thanks for sharing the code with us using pyDap module. I have removed the lines with local directory in your code and reposted here, so that other users could run the code directly without any modifications.

Code: Select all
#OPENDAP_SUBSETTING_WITH_PYTHON_TUTORIAL how to benefit from OPeNDAP subsetting in python
#
# This tutorial is also available for Matlab
#
#See also: OPeNDAP_access_with_Matlab_tutorial.py
 
# $Id:  $
# $Date:  $
# $Author:  $
# $Revision:  $
# $HeadURL: https://repos.deltares.nl/repos/OpenEarthTools/trunk/python/io/opendap/OPeNDAP_subsetting_with_python_tutorial.py $
# $Keywords: $
 
# This document is also posted on a wiki: http://public.deltares.nl/display/OET/OPeNDAP+access+with+python
 
# Read data from an opendap server
import urllib
import numpy as np
import netCDF4
import pydap
import matplotlib
import matplotlib.pyplot as plt
import pylab
 
# from opendap import opendap # OpenEarthTools module, see above that makes pypdap quack like netCDF4
 
# Define data on an opendap server
# for converting non-open access GECBO grids to same netCDF structure: see https://repos.deltares.nl/repos/OpenEarthRawData/trunk/gebco/
gridurls =  ['http://geoport.whoi.edu/thredds/dodsC/bathy/etopo1_bed_g2',
             'http://geoport.whoi.edu/thredds/dodsC/bathy/etopo2_v2c.nc',
             'http://geoport.whoi.edu/thredds/dodsC/bathy/srtm30plus_v1.nc',
             'http://geoport.whoi.edu/thredds/dodsC/bathy/srtm30plus_v6',
             'http://geoport.whoi.edu/thredds/dodsC/bathy/smith_sandwell_v9.1.nc',
             'http://geoport.whoi.edu/thredds/dodsC/bathy/smith_sandwell_v11']
 
lineurl  = 'http://opendap.deltares.nl/thredds/dodsC/opendap/noaa/gshhs/gshhs_i.nc';
 
# Get line data: 1D vectors are small, so we can get all data
linedata = netCDF4.Dataset(lineurl) # opendap(url_line) # when netCDF4 was not compiled with OPeNDAP
 
lines = dict(
   lon=linedata.variables['lon'][:],
   lat=linedata.variables['lat'][:]
   )
linedata.close()
 
# Define bounding box
BB = dict(
   lon=[ 0, 10],
   lat=[50, 55]
   )
 
for i,ncfile in enumerate(gridurls):
 
   print ncfile
 
   # Get grid data
   grid   = netCDF4.Dataset(ncfile) # opendap(ncfile) # when netCDF4 was not compiled with OPeNDAP
 
   # Get all data from lat and lon, but just a pointer to the topography
   # Because getting it all takes a bit too long
   lat = grid.variables['lat'][:]
   lon = grid.variables['lon'][:]
 
   # nonzero returns a tuple of idx per dimension
   # we're unpacking the tuple here so we can lookup max and min
   (latidx,) = np.logical_and(lat >= BB['lat'][0], lat < BB['lat'][1]).nonzero()
   (lonidx,) = np.logical_and(lon >= BB['lon'][0], lon < BB['lon'][1]).nonzero()
 
   #
   assert lat.shape[0] == grid.variables['topo'].shape[0], 'We assume first dim is lat here'
   assert lon.shape[0] == grid.variables['topo'].shape[1], 'We assume 2nd dim is lon here'
 
   # get rid of the non used lat/lon now
   lat = lat[latidx]
   lon = lon[lonidx]
   # Get the topography data with the new indices
   topo = grid.variables['topo'][latidx.min():latidx.max(),
                                 lonidx.min():lonidx.max()]
   title = grid.title # remember so we can close the file before plotting
   grid.close()
 
   Lon,Lat = np.meshgrid(lon, lat)
   # make a pcolormesh (the fastest way to plot simple grids)
   mesh = plt.pcolormesh(Lon,Lat,topo)
   plt.plot(lines['lon'],lines['lat'],'k')
   # some customizations (called on the axes and figure)
   mesh.axes.set_title(title)
   mesh.axes.set_aspect(1/np.cos(np.pi*np.mean(BB['lat'])/180.)) # set aspect to match meters
   plt.xlim(BB['lon'][0],BB['lon'][1])
   plt.ylim(BB['lat'][0],BB['lat'][1])
   mesh.axes.set_xlabel('lon [deg]')
   mesh.axes.set_ylabel('lat [deg]')
   plt.clim(-50,150)
   mesh.figure.colorbar(mesh) # use the mesh to make a colorbar
   # Save the figure
   mesh.figure.savefig('%s.png' % title)
   mesh.figure.clf()

Re: Using Python and OPeNDAP to subset large gridded dataset

PostPosted: Fri Jul 01, 2016 11:27 pm
by mariakatosvich
Thanks & welcome yiboj, for modifying the code for making it more recheable..