Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intake-STAC with NASA CMR STAC proxy: Authentication #60

Open
scottyhq opened this issue Aug 22, 2020 · 14 comments
Open

Intake-STAC with NASA CMR STAC proxy: Authentication #60

scottyhq opened this issue Aug 22, 2020 · 14 comments
Labels
documentation enhancement New feature or request
Milestone

Comments

@scottyhq
Copy link
Collaborator

As part of STAC-sprint 6 I was trying out intake-stac with https://github.com/nasa/cmr-stac. It would be absolutely amazing to integrate intake-stac with that endpoint to facilitate working with NASA datasets! But there multiple things to work out. First and foremost is how to deal with Authentication.

Unlike boto3 cloud credentials, NASA uses and 'Earthdata login' (https://urs.earthdata.nasa.gov/documentation). Typically, science users keep their username and password in a ~/.netrc file for any time you try to retrieve a file. This mechanism doesn't currently work with the intake-stac .to_dask() method. For example:

item['data'].metadata
#{'href': 'https://grfn.asf.alaska.edu/door/download/S1-GUNW-A-R-087-tops-20141023_20141011-153856-27545N_25464N-PP-1a1a-v2_0_2.nc'}
da = item['data'].to_dask()

Leads to a big traceback:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/miniconda3/envs/intake-stac-gui/lib/python3.7/site-packages/xarray/backends/file_manager.py in _acquire_with_cache_info(self, needs_lock)
    197             try:
--> 198                 file = self._cache[self._key]
    199             except KeyError:

~/miniconda3/envs/intake-stac-gui/lib/python3.7/site-packages/xarray/backends/lru_cache.py in __getitem__(self, key)
     52         with self._lock:
---> 53             value = self._cache[key]
     54             self._cache.move_to_end(key)

KeyError: [<class 'netCDF4._netCDF4.Dataset'>, ('https://grfn.asf.alaska.edu/door/download/S1-GUNW-A-R-087-tops-20141023_20141011-153856-27545N_25464N-PP-1a1a-v2_0_2.nc',), 'r', (('clobber', True), ('diskless', False), ('format', 'NETCDF4'), ('persist', False))]

During handling of the above exception, another exception occurred:

OSError                                   Traceback (most recent call last)
<ipython-input-15-90d7a2a112b8> in <module>
----> 1 da = item['data'].to_dask()

~/miniconda3/envs/intake-stac-gui/lib/python3.7/site-packages/intake_xarray/base.py in to_dask(self)
     67     def to_dask(self):
     68         """Return xarray object where variables are dask arrays"""
---> 69         return self.read_chunked()
     70 
     71     def close(self):

~/miniconda3/envs/intake-stac-gui/lib/python3.7/site-packages/intake_xarray/base.py in read_chunked(self)
     42     def read_chunked(self):
     43         """Return xarray object (which will have chunks)"""
---> 44         self._load_metadata()
     45         return self._ds
     46 

~/miniconda3/envs/intake-stac-gui/lib/python3.7/site-packages/intake/source/base.py in _load_metadata(self)
    124         """load metadata only if needed"""
    125         if self._schema is None:
--> 126             self._schema = self._get_schema()
    127             self.datashape = self._schema.datashape
    128             self.dtype = self._schema.dtype

~/miniconda3/envs/intake-stac-gui/lib/python3.7/site-packages/intake_xarray/base.py in _get_schema(self)
     16 
     17         if self._ds is None:
---> 18             self._open_dataset()
     19 
     20             metadata = {

~/miniconda3/envs/intake-stac-gui/lib/python3.7/site-packages/intake_xarray/netcdf.py in _open_dataset(self)
     56             _open_dataset = xr.open_dataset
     57 
---> 58         self._ds = _open_dataset(url, chunks=self.chunks, **kwargs)
     59 
     60     def _add_path_to_ds(self, ds):

~/miniconda3/envs/intake-stac-gui/lib/python3.7/site-packages/xarray/backends/api.py in open_dataset(filename_or_obj, group, decode_cf, mask_and_scale, decode_times, autoclose, concat_characters, decode_coords, engine, chunks, lock, cache, drop_variables, backend_kwargs, use_cftime, decode_timedelta)
    507         if engine == "netcdf4":
    508             store = backends.NetCDF4DataStore.open(
--> 509                 filename_or_obj, group=group, lock=lock, **backend_kwargs
    510             )
    511         elif engine == "scipy":

~/miniconda3/envs/intake-stac-gui/lib/python3.7/site-packages/xarray/backends/netCDF4_.py in open(cls, filename, mode, format, group, clobber, diskless, persist, lock, lock_maker, autoclose)
    356             netCDF4.Dataset, filename, mode=mode, kwargs=kwargs
    357         )
--> 358         return cls(manager, group=group, mode=mode, lock=lock, autoclose=autoclose)
    359 
    360     def _acquire(self, needs_lock=True):

~/miniconda3/envs/intake-stac-gui/lib/python3.7/site-packages/xarray/backends/netCDF4_.py in __init__(self, manager, group, mode, lock, autoclose)
    312         self._group = group
    313         self._mode = mode
--> 314         self.format = self.ds.data_model
    315         self._filename = self.ds.filepath()
    316         self.is_remote = is_remote_uri(self._filename)

~/miniconda3/envs/intake-stac-gui/lib/python3.7/site-packages/xarray/backends/netCDF4_.py in ds(self)
    365     @property
    366     def ds(self):
--> 367         return self._acquire()
    368 
    369     def open_store_variable(self, name, var):

~/miniconda3/envs/intake-stac-gui/lib/python3.7/site-packages/xarray/backends/netCDF4_.py in _acquire(self, needs_lock)
    359 
    360     def _acquire(self, needs_lock=True):
--> 361         with self._manager.acquire_context(needs_lock) as root:
    362             ds = _nc4_require_group(root, self._group, self._mode)
    363         return ds

~/miniconda3/envs/intake-stac-gui/lib/python3.7/contextlib.py in __enter__(self)
    110         del self.args, self.kwds, self.func
    111         try:
--> 112             return next(self.gen)
    113         except StopIteration:
    114             raise RuntimeError("generator didn't yield") from None

~/miniconda3/envs/intake-stac-gui/lib/python3.7/site-packages/xarray/backends/file_manager.py in acquire_context(self, needs_lock)
    184     def acquire_context(self, needs_lock=True):
    185         """Context manager for acquiring a file."""
--> 186         file, cached = self._acquire_with_cache_info(needs_lock)
    187         try:
    188             yield file

~/miniconda3/envs/intake-stac-gui/lib/python3.7/site-packages/xarray/backends/file_manager.py in _acquire_with_cache_info(self, needs_lock)
    202                     kwargs = kwargs.copy()
    203                     kwargs["mode"] = self._mode
--> 204                 file = self._opener(*self._args, **kwargs)
    205                 if self._mode == "w":
    206                     # ensure file doesn't get overriden when opened again

netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Dataset.__init__()

netCDF4/_netCDF4.pyx in netCDF4._netCDF4._ensure_nc_success()

OSError: [Errno -78] NetCDF: Authorization failure: b'https://grfn.asf.alaska.edu/door/download/S1-GUNW-A-R-087-tops-20141023_20141011-153856-27545N_25464N-PP-1a1a-v2_0_2.nc'

Full example here: https://gist.github.com/scottyhq/04fe1e2d0b946b97228f6922cf001bbd

@scottyhq
Copy link
Collaborator Author

scottyhq commented Aug 22, 2020

Since there will be lots of valuable data like this that is not in a cloud-optimized data store and format, I think it makes sense to have a download() method that can pick up ~/.netrc (equivalent to wget https://grfn.asf.alaska.edu/door/download/S1-GUNW-A-R-087-tops-20141023_20141011-153856-27545N_25464N-PP-1a1a-v2_0_2.nc). For this particular example, it is then up to a user to load into xarray from the local file:

import xarray as xr
localFile = 'S1-GUNW-A-R-087-tops-20141023_20141011-153856-27545N_25464N-PP-1a1a-v2_0_2.nc'
da = xr.open_dataset(localFile,
                     group='/science/grids/data')
da

thoughts @matthewhanson @jhamman @apawloski @martindurant ?

@martindurant
Copy link
Member

What is contained in the .netcdf file, is it user/password for the HTTP call?

In general, you can use fsspec.open_local and a URL containing caching (or a local path), and get an experience on par with other fsspec operations. Parallel downloading of multiple files should not be far off either.

@scottyhq
Copy link
Collaborator Author

scottyhq commented Aug 26, 2020

What is contained in the .netcdf file, is it user/password for the HTTP call?

cat ~/.netrc looks like this:

machine urs.earthdata.nasa.gov login MYUSERNAME  password MYPASSWORD

It looks like the requests library automatically picks up this file (code block below works). There is even a standard library module for reading it (https://docs.python.org/3/library/netrc.html)! but I'm unsure how to get fsspec to read it / pass username and password to HTTPFileSystem

url = item['data'].urlpath
with open('test.nc', 'wb') as f:
    resp = requests.get(url)
    f.write(resp.content)

@martindurant
Copy link
Member

martindurant commented Aug 26, 2020

fsspec uses aiohttp, not requests, so maybe that's why it's not getting picked up automatically. In this case, it should work like

(username, account, password) = netrc.netrc().authenticators("urs.earthdata.nasa.gov")
of = fsspec.open(url, "rb", auth=(username, password)})
with of as f:
    ...

or

fs = fsspec.filesystem("http")  # can include auth here for all URLs, or specify with open
f = fs.open(url, "rb", auth=(username, password))

Actually, after a little reading, it seems that aiohttp does support this, if the client is passed trust_env=True (see aio-libs/aiohttp#2584 ), but there is no way to get this arg to the client in fsspec right now. It would be easy to add (client_kwargs=None, for example, as done for s3fs), if someone is willing.

@scottyhq
Copy link
Collaborator Author

scottyhq commented Sep 1, 2020

related PR over in sat-stac sat-utils/sat-stac#62

@scottyhq
Copy link
Collaborator Author

scottyhq commented Sep 4, 2020

Hi @martindurant - after trying a few other approaches to see how this works behind the scenes I'm a bit confused.

The following code works using aiohttp directly:

import aiohttp

url = item['data'].urlpath
auth=aiohttp.BasicAuth(username,password)

async with aiohttp.ClientSession(auth=auth) as session:
    async with session.get(url) as resp:
        print(resp.status)
        with open('local.nc', 'wb') as f:
            f.write(await resp.read())

I can't seem to get the ~/.netrc picked up, reading the PR you linked to and docs, maybe there is a separate workflow dealing with proxies that this gets into, because the following returns 401 Unauthorized Basic realm="Please enter your Earthdata Login credentials

async with aiohttp.ClientSession(trust_env=True) as session:
    async with session.get(url) as resp:
        print(resp.text)
        with open('local.nc', 'wb') as f:
            f.write(await resp.read())

If I use fssepc as you suggested with the following i get a FileNotFoundError

(username, account, password) = netrc.netrc().authenticators("urs.earthdata.nasa.gov")
auth=(username,password)
of = fsspec.open(url, "rb", auth=(username, password))
with of as remote:
    with open('local.nc', 'wb') as local:
        local.write(remote.read())

Finally, I thought this might work, but I get a ClientResponseError

fs = fsspec.filesystem("http", auth=aiohttp.BasicAuth(username,password))
with fs.open(url, "rb") as remote:
    with open('local.nc', 'wb') as local:
        local.write(remote.read())   

Interestingly for the last case, the traceback provides a link that if I click on in my browser the download works!?

ClientResponseError: 401, message='Unauthorized', url=URL('https://urs.earthdata.nasa.gov/oauth/authorize?app_type=401&client_id=iwntGSgHy9yoog7Mjag0dQ&response_type=code&redirect_uri=https://grfn.asf.alaska.edu/door/oauth&state=aHR0cDovL2dyZm4uYXNmLmFsYXNrYS5lZHUvZG9vci9kb3dubG9hZC9TMS1HVU5XLUEtUi0wODctdG9wcy0yMDE0MTAyM18yMDE0MTAxMS0xNTM4NTYtMjc1NDVOXzI1NDY0Ti1QUC0xYTFhLXYyXzBfMi5uYw')

Could you please advise on how to use fsspec directly? And where would be best to implement the reading of credentials (intake,fsspec,aiohttp,intake-stac?) from ~/.netrc so that a user doesn't have to write code to load them?

@martindurant
Copy link
Member

And where would be best to implement the reading of credentials

The HttpFileSystem ought to have an option, so that you can pass the trust_env parameter - although it seems maybe that isn't working for you. I've never heard of .netrc before, but it doesn't sound stac-specific. If we can't get aiohttp to find and use it automatically, then fsspec would be the place to handle it.

Is there any chance you can share some creds privately so that I can test what works?

@scottyhq
Copy link
Collaborator Author

scottyhq commented Sep 4, 2020

thanks for you help @martindurant !

there are definitely two things to figure out: 1) how to correctly pass username and password explicitly to httpfilesystem (the last code block seems close!) and 2) getting the netrc read correctly behind the scenes.

I can send you creds via keybase or however you prefer, it's also easy to register (https://urs.earthdata.nasa.gov/home) this is NASA's standard login which anyone can sign up for w/ some basic info.

@martindurant
Copy link
Member

OK, I can sign up - but I won't get to this until next week now.

@martindurant
Copy link
Member

It turns out, if you manually follow the redirect - i.e., apply the auth again to the generated URL - you can get the file. I feel like I'm getting somewhere.

@martindurant
Copy link
Member

With fsspec/filesystem_spec#400 , you can do

fs = fsspec.filesystem('http', client_kwargs={'auth': aiohttp.BasicAuth('mdurant', 'xx')})
with fs.open(url) as f:
    f.read()

I don't know why passing in the open kwargs or putting in .netrc isn't working, even with trust_env=True

@scottyhq
Copy link
Collaborator Author

scottyhq commented Sep 9, 2020

Thanks @martindurant !

I don't know why passing in the open kwargs or putting in .netrc isn't working, even with trust_env=True

There definitely is something odd with how aiohttp handles the netrc auth. Short of opening an issue upstream, I'm wondering if in fsspec we could have an option that generates the aiohttp.BasicAuth from a netrc. For example fs = fsspec.filesystem('http', netrc_auth="urs.earthdata.nasa.gov")

I'm still unclear about how to get this into intake-stac as well. Seems like some sort of auth arguments need to be accepted here

class AbstractStacCatalog(Catalog):
, which get passed down the chain. For example:

from intake import open_stac_catalog
catalog_url = 'https://raw.githubusercontent.com/cholmes/sample-stac/master/stac/catalog.json'
cat = open_stac_catalog(catalog_url,  netrc_auth="urs.earthdata.nasa.gov")

Such that whenever a user opens a file, the auth settings are in place:

item = catalog['myitem']
da = item['data'].to_dask()

@martindurant
Copy link
Member

martindurant commented Sep 9, 2020

Seem like it needs to migrate to this ilne, where we know the URL, and can do the login lookup. That should be the default, but probably the user should be able to override.

weiji14 added a commit to weiji14/deepicedrain that referenced this issue Sep 17, 2020
The newer fsspec 0.8.0 uses aiohttp for http requests, and that breaks the netrc authentication to the Earthdata site. Using fsspec 0.8.2 helps a bit, but still throws an error like "ClientResponseError: 401, message='Unauthorized', url=URL('https://urs.earthdata.nasa.gov/oauth/authorize?app_type=401...". Need to figure out how to inject the credentials into the intake_xarray/intake/fsspec/aiohttp stack somehow, but need to temporarily downgrading for now. See also relevant discussion on intake/intake-stac#60.
@scottyhq scottyhq added the enhancement New feature or request label Oct 16, 2020
@carygeo
Copy link

carygeo commented Nov 14, 2020

The .netrc authentication steps in Remote NetCDF + Authentication from this example worked for me:
https://github.com/intake/intake-stac/blob/master/examples/intake-cmr-stac.ipynb

@scottyhq scottyhq added this to the v0.4.0 milestone Jun 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants