Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to deploy custom elastic-agents on different OS or runtimes #787

Open
marc-gr opened this issue Apr 11, 2022 · 37 comments
Open
Assignees
Labels
discuss Team:Ecosystem Label for the Packages Ecosystem team

Comments

@marc-gr
Copy link
Contributor

marc-gr commented Apr 11, 2022

There are some integrations that might require elastic-agents with custom configurations for them or their container ie: winlogbeat requires a windows container, auditbeat requires special container capabilities, etc.

I initially created #786 that adds ability to deploy custom agents as test services, there is still missing code specific to deal with the windows scenario.

I open this thread to discuss other approaches that might avoid adding the complexity to the test runner if possible.

EDIT:

As mentioned in #787 (comment) , there is now support in elastic-package to:

  • run tests in independent Elastic Agents (for now in Linux):
    • these independent Elastic Agents can be customized (with capabilities and scripts).
    • each system test writes into its own data streams
  • run system tests in parallel
    • being able to set through an environment variable the maximum number of routines to run in parallel

It will be pending here to allow running Elastic Agents in other OS (e.g. Windows) or in other runtimes (VMs?).

@jlind23 jlind23 added the Team:Ecosystem Label for the Packages Ecosystem team label Apr 11, 2022
@mtojek
Copy link
Contributor

mtojek commented Apr 11, 2022

Regarding #786:

I recommend not modifying the test runner much/at all. We'd like to rewrite that code eventually as it has too many responsibilities and it's error-prone now.

I see a few approaches that we can apply here.

Extend "profiles" with local patches

When a user or CI executes elastic-package stack up command, it checks if there are any local profile patches and creates a custom profile (if it doesn't exist) for that particular integration.

Extend Compose stack definition with environment variables

Let's not add anything special except allowing for var customizations in the Compose stack. Whenever a user or CI executes elastic-package stack up command, it will also load local vars.

Problem: it doesn't solve the problem of elastic-package stack booting on the Windows machine. Should we move it to another, separate issue? Actually, you can pair it with image overrides.

Hack: retag elatic-agent image

It's a hack but may work temporarily. Before running the elastic-package stack up we need to build the agent's Docker image for Windows and replace/retag the current one. It might be hard due to different stack versions and configuration changes.

@mtojek
Copy link
Contributor

mtojek commented Apr 11, 2022

pinging @jsoriano for his thoughts around this and #786.

@jsoriano
Copy link
Member

Another option could be to move agent initialization to the system test runner, and remove it from the default stack definition. Having it in the runner would allow it to have full control of the started agents, allowing to start them with different options, and handling platform-specific needings, as could be the case of Windows.
It could also allow to start different tests with different agent configuration, for example to test different auditbeat configurations with different capabilities.

We already have a custom agent for the Kubernetes service deployer, we could follow a similar strategy on any other deployer.
There could be general options, such as options to select the version to use, or things like capabilities to add. And there could be platform specific options, that would also allow things like selecting daemonset vs deployment in Kubernetes (#465). On a second iteration these options could be easily overridden with flags or environment variables.

If we remove agent from the default stack definition, we could still have a stack subcommand to start agents for manual tests. Something like this could also cover #548.

I think this could be a more future-proof option, but it can require an important effort.


And another option for the use case of starting an agent but no service, can be to add a new "system" deployer, that just starts an agent with a given configuration, intended for system-level monitoring. This could help with packages for auditbeat or for the system module itself. This could be extended in the future to start completely different OSs using VMs.

This would be more in line of #786, but without needing to hack over the current test runner and compose deployer.

@mtojek
Copy link
Contributor

mtojek commented Apr 12, 2022

Another option could be to move agent initialization to the system test runner, and remove it from the default stack definition. Having it in the runner would allow it to have full control of the started agents, allowing to start them with different options, and handling platform-specific needings, as could be the case of Windows.
It could also allow to start different tests with different agent configuration, for example to test different auditbeat configurations with different capabilities.

There are two constraints related to this approach:

  1. Don't forget that this is the mode we also use for development purposes. You can simply start the stack and have everything ready. It is really convenient.
  2. Agent enrollment with fleet server takes time. We considered this option at the early stage and decided to follow the "enroll once" approach at startup. It's also easier to debug if you have the agent instance present, not wiped out.

We already have a custom agent for the Kubernetes service deployer, we could follow a similar strategy on any other deployer.
There could be general options, such as options to select the version to use, or things like capabilities to add. And there could be platform specific options, that would also allow things like selecting daemonset vs deployment in Kubernetes (#465). On a second iteration these options could be easily overridden with flags or environment variables.

I like the approach of having the custom agent setup. You're right that we could apply similar logic as for kind, to spawn a new agent. This way we don't need to modify test runners at all. Most likely we may need two setups: custom image properties and windows.

If we remove agent from the default stack definition, we could still have a stack subcommand to start agents for manual tests. Something like this could also cover #548.

I had that in mind before, hence the issue, but always considered its complexity as +Inf. Maybe we can evaluate it as a good first issue and "rebuild the stack command"?


To sum up, my vote would go to custom agent setup.

@jsoriano
Copy link
Member

We already have a custom agent for the Kubernetes service deployer, we could follow a similar strategy on any other deployer.
There could be general options, such as options to select the version to use, or things like capabilities to add. And there could be platform specific options, that would also allow things like selecting daemonset vs deployment in Kubernetes (#465). On a second iteration these options could be easily overridden with flags or environment variables.

I like the approach of having the custom agent setup. You're right that we could apply similar logic as for kind, to spawn a new agent. This way we don't need to modify test runners at all. Most likely we may need two setups: custom image properties and windows.

Could this be done without modifying runners?

@mtojek
Copy link
Contributor

mtojek commented Apr 12, 2022

Yes, I think so. Same way as closed most of the changes for Kubernetes service deployer in this file. There might be one inconvenience, the agent will be deployed during the first run of the system test.

@jsoriano
Copy link
Member

Ah ok, but it would be modifying service deployers. Would you prefer to add an agent to the compose deployer, or to add a new deployer for these use cases?

@mtojek
Copy link
Contributor

mtojek commented Apr 12, 2022

Would you prefer to add an agent to the compose deployer

It looks like it depends on the final infrastructure setup. Not sure if that option will work for @marc-gr and Windows containers.

add a new deployer for these use cases?

This option seems to be pluggable and flexible in terms of specific configuration properties or OS-specific logic. It has also an extra benefit, it will prevent copying a custom agent code to multiple places.

I'm thinking now if we aren't close to introducing a feature of using an agent under development. This way you could use even standalone builds. Maybe we should implement a proxy instead :)

@jsoriano
Copy link
Member

jsoriano commented May 3, 2022

Discussed offline about this with Marc, he is going to explore the option of implementing something like #786, but as a new deployer, so the runner is not modified. This could cover the current auditbeat needings.

We also discussed that probably we need something like vms for system tests, this will be neccesary to support running tests with windows, or even with linux if not enough privileges can be granted with containers for some use cases.
We could run these tests on specialized CI workers, as in elastic/integrations#1713 for Windows, but it'd be nice to have something in elastic-package so developers working on Mac/Linux can also run these tests locally.

@cmacknz
Copy link
Member

cmacknz commented Jan 30, 2024

but it'd be nice to have something in elastic-package so developers working on Mac/Linux can also run these tests locally.

For Linux https://multipass.run/ is a good cross-platform solution as long as you are fine with only supporting Ubuntu VMs. For Windows there is no cross-platform equivalent, you have to provision cloud VMs.

This is generally what we do in the Elastic Agent test framework, https://github.com/elastic/elastic-agent/blob/main/docs/test-framework-dev-guide.md. You can test locally against multipass Ubuntu VMs, otherwise we are provisioning Linux and Windows machines in the cloud. MacOS VM support is TBD.

It would be good if we could align the provisioning here with the agent framework so we aren't maintaining this functionality twice. The only quirk with the agent test framework is it uses https://github.com/adam-stokes/ogc for provisioning, we'd prefer to use Terraform but we haven't gotten that implemented yet. elastic/elastic-agent#2935

@cmacknz
Copy link
Member

cmacknz commented Jan 30, 2024

CC @blakerouse

@mrodm
Copy link
Contributor

mrodm commented Jun 3, 2024

When enabling independent Elastic Agents, there are some packages that last around 3 hours to finish their tests (mainly system tests).

Added a new PR to allow creating a new Agent Policy per each test executed: #1866

This will allow us to:

  • One step closer to be able to run in parallel system tests since every tests is going to use a different data stream to ingest docs.
  • Reduce complexity when using stages in system tests (e.g. --no-provision flag)

@mrodm
Copy link
Contributor

mrodm commented Jun 10, 2024

Two new PRs created to change how test runners work in elastic-package:

These two PRs introduce two different interfaces to manage runners and tests:

  • Tester interface:
    • it handles the execution of just one test with its own lifecycle
      • Each testrunner can define its own specific tests, for instance:
        • asset tests: just one test for all the package.
        • system tests: one test per configuration file and variant.
        • policy tests: one test per configuration file.
        • ...
  • TestRunner interface:
    • it handles the creation (and destruction) of global resources required for tests.
    • it handles the creation of Tester instances, one per test defined in the package.

@mrodm
Copy link
Contributor

mrodm commented Jun 17, 2024

Next step is adding support to run system tests in parallel in elastic-package.

This work is being done in two different PRs:

This will allow us to set system tests in parallel in packages with large number of system tests like network_traffic or zeek.

@mrodm
Copy link
Contributor

mrodm commented Jun 17, 2024

Running some tests in this PR from integrations with just 2 packages (network_traffic and zeek) elastic/integrations#10161

Comparing times among the different settings:

Package Sequential (stack Elastic Agent) Sequential (independent Elastic Agents) Up to 3 Up to 5 Up to 8
network_traffic 1h 20min 2h 20min 1h 7min 42min Error (timeouts)
zeek 1h 10min 2h 40min 1h 8min 42min Error (timeouts)

CI builds:

@mrodm
Copy link
Contributor

mrodm commented Jun 17, 2024

I was wondering to close this issue once this PR (#1909) is merged.

All the support related to independent Elastic Agents and running system tests in parallel would be completed at that point.

It would be missing:

  • release a new version of elastic-package and being integrated to integrations repository.
  • enable independent Elastic Agents (through environment variable) and enable parallel system tests in network_traffic and zeek packages.

For that, it could be created a follow-up issue to enable those features in the integrations repository. There will be some packages to update while doing so. At least, auditd_manager and oracle (see related PoC PR about the changes required elastic/integrations#9862).

It could be created another issue to run the system tests using the independent Elastic Agents by default. Could this be done as part of a different issue too?

However, that means that developers would be triggering the tests using the Elastic Agent from the stack but the CI would be using the new independent Elastic Agents. If they want to be running independent Elastic Agents should be setting the environment variable: ELASTIC_PACKAGE_TEST_ENABLE_INDEPENDENT_AGENT=true

WDYT about closing this one (once the PR is merged) in favor of creating those new issues? @jsoriano @kpollich

@mrodm
Copy link
Contributor

mrodm commented Jun 17, 2024

Just to add to the previous comment, it should be updated the docs too about these new settings.

I'll update the current PR with the changes required about the docs:
https://github.com/elastic/elastic-package/blob/main/docs/howto/system_testing.md#running-system-tests-with-independent-elastic-agents-in-each-test-technical-preview

EDIT: updated in 419f8ea

@jsoriano
Copy link
Member

I was wondering to close this issue once this PR (#1909) is merged.

Yep, I mostly agree with closing this once we can enable independent agents more generally. But please take into account that one the original motivations for this issue was to be able to run winlog tests, and we are still unable to run Windows agents for this.
If we close this issue, please ensure that we keep some issue open for use cases on different operating systems.

@mrodm
Copy link
Contributor

mrodm commented Jun 18, 2024

But please take into account that one the original motivations for this issue was to be able to run winlog tests, and we are still unable to run Windows agents for this.
If we close this issue, please ensure that we keep some issue open for use cases on different operating systems.

That's true, I could keep this issue open (since there are other issues already linked to this one) even if the above mentioned PRs are merged, until we could find time to work on adding support to run Elastic Agents in other OS or runtimes.

@mrodm
Copy link
Contributor

mrodm commented Jun 19, 2024

Created package-spec release 3.2.0 (elastic/package-spec#764) that includes the definition of the new configuration files to enable or not system parallel tests.

@mrodm
Copy link
Contributor

mrodm commented Jun 20, 2024

As a summary for what it has been achieved until now, with the latest Pull Requests merged linked to this issue, there is now support in elastic-package to:

  • run tests in independent Elastic Agents (for now in Linux):
    • these independent Elastic Agents can be customized (with capabilities and scripts).
    • each system test writes into its own data streams
  • run system tests in parallel
    • being able to set through an environment variable the maximum number of routines to run in parallel

As a follow-up, I created this issue to enable these features in the integrations repository:
elastic/integrations#10201

It will be pending here to allow running Elastic Agents in other OS (e.g. Windows) or in other runtimes (VMs?).

Updated title and description accordingly.

cc @jsoriano @kpollich

@mrodm mrodm changed the title Add ability to deploy custom elastic-agents Add ability to deploy custom elastic-agents on different OS or runtimes Jun 20, 2024
@mrodm mrodm removed their assignment Jul 2, 2024
@kpollich
Copy link
Member

kpollich commented Jul 2, 2024

Thanks for providing a summary of where we are today, @mrodm. I'm moving this into a quality sprint for now as we'll need to dedicate a large amount of time here if we prioritize adding cross-platform support to this new type of test.

@mrodm
Copy link
Contributor

mrodm commented Sep 4, 2024

A use case that this feature could be helpful would be for the system_audit package in the integrations repository (running Elastic agent in different VMs).

This would allow to run the system tests with Elastic Agents running in different Linux OS other than Ubuntu, e.g. Fedora. So it could be tested that it can collect the required logs from rpm package manager.
Related to elastic/integrations#11000

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Team:Ecosystem Label for the Packages Ecosystem team
Projects
None yet
Development

No branches or pull requests

7 participants