Checksum Information

Checksum Information

Postby mgangl » Wed Aug 04, 2021 12:23 pm

Question: I'm used to downloading the *.md5 file along with my data to ensure i've got the right data. How will I do this in the cloud?

Checksums are alive and well in the cloud! There are 2 ways of accessing checksum information. The first is to download the md5 (or whatever checksum is being used), the second is to get the actual checksum out of the search metadata itself. The metadata will always include the checksum information, while some datasets may or may not have checksum "sidecar" files available for download.


Take a given dataset:

Dataset Shortname: MODIS_A-JPL-L2P-v2019.0
Concept-ID: C1940473819-POCLOUD
Earthdata Search: Link

Each given granule has a CMR entry that includes some useful information:

https://cmr.earthdata.nasa.gov/search/c ... 41-POCLOUD

Code: Select all
  "RelatedUrls": [
    {
      "URL": "https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-public/MODIS_A-JPL-L2P-v2019.0/20210804165501-JPL-L2P_GHRSST-SSTskin-MODIS_A-D-v02.0-fv01.0.nc.md5",
      "Description": "Download 20210804165501-JPL-L2P_GHRSST-SSTskin-MODIS_A-D-v02.0-fv01.0.nc.md5",
      "Type": "EXTENDED METADATA"
    },
    {
      "URL": "https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/MODIS_A-JPL-L2P-v2019.0/20210804165501-JPL-L2P_GHRSST-SSTskin-MODIS_A-D-v02.0-fv01.0.nc",
      "Description": "Download 20210804165501-JPL-L2P_GHRSST-SSTskin-MODIS_A-D-v02.0-fv01.0.nc",
      "Type": "GET DATA"
    },
    {
      "URL": "https://archive.podaac.earthdata.nasa.gov/s3credentials",
      "Description": "api endpoint to retrieve temporary credentials valid for same-region direct s3 access",
      "Type": "VIEW RELATED INFORMATION"
    },
    {
      "URL": "https://opendap.earthdata.nasa.gov/collections/C1940473819-POCLOUD/granules/20210804165501-JPL-L2P_GHRSST-SSTskin-MODIS_A-D-v02.0-fv01.0",
      "Type": "USE SERVICE API",
      "Subtype": "OPENDAP DATA",
      "Description": "OPeNDAP request URL"
    }
  ]



We can see there are 2 files, usuaully, of interest here, the data file and the checksum file. The download urls for these are:

https://archive.podaac.earthdata.nasa.g ... -fv01.0.nc

https://archive.podaac.earthdata.nasa.g ... 1.0.nc.md5

so in a pinch, you could always try downloading the ".md5" file, but notice that the data file uses the "podaac-ops-cumulus-protected" path in the URL while the md5 file uses the "podaac-ops-cumulus-public" path.

Alternatively, you can get the metadata from the search result. Let's take a look at our original search result:

https://cmr.earthdata.nasa.gov/search/c ... 41-POCLOUD

Code: Select all
 "DataGranule": {
    "ArchiveAndDistributionInformation": [
      {
        "SizeUnit": "MB",
        "Size": 9.34600830078125e-05,
        "Checksum": {
          "Value": "931ab5f4cee34d8d1e03ffda2d1076e2",
          "Algorithm": "MD5"
        },
        "SizeInBytes": 98,
        "Name": "20210804165501-JPL-L2P_GHRSST-SSTskin-MODIS_A-D-v02.0-fv01.0.nc.md5"
      },
      {
        "SizeUnit": "MB",
        "Size": 21.401312828063965,
        "Checksum": {
          "Value": "caced2a0a12d143d2fa258b7926db265",
          "Algorithm": "MD5"
        },
        "SizeInBytes": 22440903,
        "Name": "20210804165501-JPL-L2P_GHRSST-SSTskin-MODIS_A-D-v02.0-fv01.0.nc"
      }
    ],
    "DayNightFlag": "Unspecified",
    "ProductionDateTime": "2021-08-04T20:03:27.000Z"
  },



We can see that the metadata for the downloadable data file includes the checksum of the data file:

Code: Select all
{
        "SizeUnit": "MB",
        "Size": 21.401312828063965,
        "Checksum": {
          "Value": "caced2a0a12d143d2fa258b7926db265",
          "Algorithm": "MD5"
        },
        "SizeInBytes": 22440903,
        "Name": "20210804165501-JPL-L2P_GHRSST-SSTskin-MODIS_A-D-v02.0-fv01.0.nc"
      }


in this case, the checksum is `caced2a0a12d143d2fa258b7926db265`. If we download the matching checksum file and look at it, we see that same string:

Code: Select all
(pyfuse) MT-212570:cmr-drive gangl$ wget https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-public/MODIS_A-JPL-L2P-v2019.0/20210804165501-JPL-L2P_GHRSST-SSTskin-MODIS_A-D-v02.0-fv01.0.nc.md5
...
(pyfuse) MT-212570:cmr-drive gangl$ cat 20210804165501-JPL-L2P_GHRSST-SSTskin-MODIS_A-D-v02.0-fv01.0.nc.md5
caced2a0a12d143d2fa258b7926db265  20210804165501-JPL-L2P_GHRSST-SSTskin-MODIS_A-D-v02.0-fv01.0.nc
(pyfuse) MT-212570:cmr-drive gangl$



This checksum can be used in your processing to ensure the file is correct and valid.

For an example of a collection that does not have downloadable ".md5" files, take a look at the following granule:

https://cmr.earthdata.nasa.gov/search/c ... 38-POCLOUD

The only place to get the checksum information for this file is in the metadata:

Code: Select all
{
  "SizeUnit": "MB",
  "Size": 45.76008319854736,
  "Checksum": {
    "Value": "7b67b73819a2a59f0194d2ad35b84ac5",
    "Algorithm": "MD5"
  },
  "SizeInBytes": 47982925,
  "Name": "S6A_P4_2__LR_STD__NR_027_057_20210804T164357_20210804T183529_F03.nc"
},


To future proof your code and process, we should rely on the checksums from the search metadata. The following examples us the "jq" command line client which is incredibly handy for parsing and querying json.

Code: Select all
curl "https://cmr.earthdata.nasa.gov/search/concepts/G2099210338-POCLOUD" | jq .DataGranule.ArchiveAndDistributionInformation

 % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  5694  100  5694    0     0  29502      0 --:--:-- --:--:-- --:--:-- 29502
[
  {
    "SizeUnit": "MB",
    "Size": 0.30828857421875,
    "Checksum": {
      "Value": "ac66e43bf077c473f51e5f2de229e3d4",
      "Algorithm": "MD5"
    },
    "SizeInBytes": 323264,
    "Name": "S6A_P4_2__LR_STD__NR_027_057_20210804T164357_20210804T183529_F03.xfdumanifest.xml"
  },
  {
    "SizeUnit": "MB",
    "Size": 45.76008319854736,
    "Checksum": {
      "Value": "7b67b73819a2a59f0194d2ad35b84ac5",
      "Algorithm": "MD5"
    },
    "SizeInBytes": 47982925,
    "Name": "S6A_P4_2__LR_STD__NR_027_057_20210804T164357_20210804T183529_F03.nc"
  },
  {
    "SizeUnit": "MB",
    "Size": 4.683554649353027,
    "Checksum": {
      "Value": "bc579b53d4c30980bd40316ae08035cb",
      "Algorithm": "MD5"
    },
    "SizeInBytes": 4911063,
    "Name": "S6A_P4_2__LR_STD__NR_027_057_20210804T164357_20210804T183529_F03.bufr.bin"
  }
]



But there are a bunch of files in the above example, i only want the netcdf file! Well we can drill down to that on the commandline as well:

Code: Select all
curl "https://cmr.earthdata.nasa.gov/search/concepts/G2099210338-POCLOUD" | jq '.DataGranule.ArchiveAndDistributionInformation[] | select(.Name|endswith(".nc"))'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  5694  100  5694    0     0  16222      0 --:--:-- --:--:-- --:--:-- 16222
{
  "SizeUnit": "MB",
  "Size": 45.76008319854736,
  "Checksum": {
    "Value": "7b67b73819a2a59f0194d2ad35b84ac5",
    "Algorithm": "MD5"
  },
  "SizeInBytes": 47982925,
  "Name": "S6A_P4_2__LR_STD__NR_027_057_20210804T164357_20210804T183529_F03.nc"
}



If you have any questions, don't hesitate to ask in this forum!
mgangl
 
Posts: 21
Joined: Wed Apr 27, 2016 1:31 pm

Return to CLOUD DATA - ACCESS

cron