Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed modelnet-prio-sched-test.sh and modelnet-test-dragonfly-synthetic.sh on Ubuntu #206

Open
shahimshaar opened this issue Aug 6, 2020 · 2 comments
Labels
bug Something isn't working

Comments

@shahimshaar
Copy link

Followed the installation instructions and it all worked with no hiccups, but when I ran 'make check' it failed two tests: modelnet-prio-sched-test.sh and modelnet-test-dragonfly-synthetic.sh . Upon running 'cat ./test-suite.log' I got the following result:

=================================
codes 1.2: ./test-suite.log

TOTAL: 22

PASS: 20

SKIP: 0

XFAIL: 0

FAIL: 2

XPASS: 0

ERROR: 0

.. contents:: :depth: 2

FAIL: tests/modelnet-test-dragonfly-synthetic.sh

credit_size not specified, using default: 8
no credit_delay specified - all credit delays set to 1.42
Within-node eager limit (node_eager_limit) not specified, setting to 16000
../codes/tests/modelnet-test-dragonfly-synthetic.sh: line 3: 17976 Killed src/network-workloads/model-net-synthetic --sync=1 --num_messages=1 -- $srcdir/src/network-workloads/conf/modelnet-synthetic-dragonfly.conf
FAIL tests/modelnet-test-dragonfly-synthetic.sh (exit status: 137)

FAIL: tests/modelnet-prio-sched-test.sh

Bandwidth of compute node channels not specified, setting to 20.000000
Within-node eager limit (node_eager_limit) not specified, setting to 16000
/home/shahm/codes-dev/build-codes/tests/.libs/modelnet-prio-sched-test --sync=1 -- tests/conf/modelnet-prio-sched-test.conf

Thu Aug 6 17:54:43 2020

ROSS Version: v7.2.0

tw_net_start: Found world size to be 1
NIC num injection port not specified, setting to 1
NIC seq delay not specified, setting to 10.000000
NIC num copy queues not specified, setting to 1
within node transfer per byte delay is 0.050000

ROSS Core Configuration:
Total PEs 1
Total KPs [Nodes (1) x KPs (16)] 16
Total LPs 4
Simulation End Time 31536000000000000.00
LP-to-PE Mapping model defined

ROSS Event Memory Allocation:
Model events 1025
Network events 16
Total events 1040

*** START SEQUENTIAL SIMULATION ***

Set num_servers per router 1, servers per injection queue per router 1, servers per node copy queue per node 1, num nics 1
*** END SIMULATION ***

: Running Time = 0.0002 seconds

TW Library Statistics:
Total Events Processed 511
Events Aborted (part of RBs) 0
Events Rolled Back 0
Event Ties Detected in PE Queues 0
Efficiency 100.00 %
Total Remote (shared mem) Events Processed 0
Percent Remote Events 0.00 %
Total Remote (network) Events Processed 0
Percent Remote Events 0.00 %

Total Roll Backs                                             0
Primary Roll Backs                                           0
Secondary Roll Backs                                         0
Fossil Collect Attempts                                      0
Total GVT Computations                                       0

Net Events Processed                                       511
Event Rate (events/sec)                              2823204.4
Total Events Scheduled Past End Time                         0

TW Memory Statistics:
Events Allocated 1041
Memory Allocated 618
Memory Wasted 454

TW Data Structure sizes in bytes (sizeof):
PE struct 624
KP struct 144
LP struct 136
LP Model struct 96
LP RNGs 80
Total LP 312
Event struct 152
Event struct with Model 552

TW Clock Cycle Statistics (MAX values in secs at 1.0000 GHz):
Initialization 0.7451
Priority Queue (enq/deq) 0.0000
AVL Tree (insert/delete) 0.0000
LZ4 (de)compression 0.0000
Buddy system 0.0000
Event Processing 0.0000
Event Cancel 0.0000
Event Abort 0.0000

GVT                                                     0.0000
Fossil Collect                                          0.0000
Primary Rollbacks                                       0.0000
Network Read                                            0.0000
Other Network                                           0.0000
Instrumentation (computation)                           0.0000
Instrumentation (write)                                 0.0000
Total Time (Note: Using Running Time above for Speedup)      0.0005

TW GVT Statistics: MPI AllReduce
GVT Interval 16
GVT Real Time Interval (cycles) 0
GVT Real Time Interval (sec) 0.00000000
Batch Size 16

Forced GVT                                                   0
Total GVT Computations                                       0
Total All Reduce Calls                                       0
Average Reduction / GVT                                   -nan

mpirun has detected an attempt to run as root.
Running at root is strongly discouraged as any mistake (e.g., in
defining TMPDIR) or bug can result in catastrophic damage to the OS
file system, leaving your system in an unusable state.

You can override this protection by adding the --allow-run-as-root
option to your cmd line. However, we reiterate our strong advice
against doing so - please do so at your own risk.

FAIL tests/modelnet-prio-sched-test.sh (exit status: 1)

Any help would be much appreciated!

@nmcglo
Copy link
Member

nmcglo commented Aug 6, 2020

Interesting. Are you, by any chance, using Docker or some other container system?

-The weird behavior regarding a status 137 error in that specific dragonfly test has been noted in the past when someone was using containers (#198). Since it didn't seem to affect regular usage of CODES, it was put on the backburner at the time due to some tight deadlines on my end followed by the rest of 2020's events!

-Just some cursory googling about the mpirun-as-root warning seems to imply that this also happens with docker containers and openmpi. Also noted in #198, adding a user appuser to the container would avoid the usage of mpirun by the root.

I'll spend some time this weekend to see about making a "building CODES with Docker" workflow. In the mean time, I'd suggest ignoring these failed tests. Let me know if other errors pop up during your usage of CODES.

@gonsie
Copy link
Contributor

gonsie commented Oct 6, 2020

I’m trying to have ROSS CI testing run CODES tests and just found a hang on modelnet-test-dragonfly-synthetic.sh. It’s most likely related since the Travis CI tests are running a container. Is there a way that I skip certain tests with the make check command?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants