You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm doing a dynamic simulation with a frozen model. The content of simulation is neb calculation and energy curve drawing. In the process of running the simulation. Let's say my initial state is A, my intermediate state is B, and my final state is C. I performed neb calculations of A-->B, A-->C on the M cluster. I moved the content to cluster N because the cluster permissions expired. In order to test the error, neb calculations of A-->B, A- >C are also performed in N cluster. The results show a slight difference in the energy curve, with an energy difference of 0.2 eV for the transition state. I want to ask is this caused by the cluster? In M cluster, deepmd is v2.2.6-1-g174f204a。In N cluster, deepmd is v2.2.6-1-g174f204a.
In addition, there may be an MPI error when I tried to do a B ->C calculation on the N cluster. I consulted the founder of the cluster and his testing suggestion may be a problem with deepmd itself. I really don't know what's causing it or how to fix it. Therefore, I am sending this issue to ask you for help. The following is an error output, and attached is an example.
ARNING: There was an error initializing an OpenFabrics device.
Local host: cn9
Local device: hfi1_0
MPI_ABORT was invoked on rank 12 in communicator MPI_COMM_WORLD
with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
[cn9:47344] 15 more processes have sent help message help-mpi-btl-openib.txt / error in device init
[cn9:47344] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[cn9:47344] 3 more processes have sent help message help-mpi-api.txt / mpi-abort
Error(2):
2024-12-05 18:51:06.726189: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled
2024-12-05 18:51:06.726267: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled
2024-12-05 18:51:06.725967: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 6
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 5
Error (3):
2024-12-05 15:11:25.911162: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled
2024-12-05 15:11:25.912789: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled
Fatal error in PMPI_Allgather: Other MPI error, error stack:
PMPI_Allgather(421).........................: MPI_Allgather(sbuf=0x7ffd3910cfd0, scount=4, MPI_DOUBLE, rbuf=0x55dc221066c0, rcount=4, datatype=MPI_DOUBLE, comm=comm=0x84000006) failed
MPIR_Allgather_impl(239)....................:
MPIR_Allgather_intra_auto(145)..............: Failure during collective
MPIR_Allgather_intra_auto(141)..............:
MPIR_Allgather_intra_recursive_doubling(216): Failure during collective
Fatal error in PMPI_Allgather: Message truncated, error stack:
PMPI_Allgather(421).........................: MPI_Allgather(sbuf=0x7ffcb0654ed0, scount=1, MPI_DOUBLE, rbuf=0x56528cc45a80, rcount=1, datatype=MPI_DOUBLE, comm=comm=0x84000006) failed
MPIR_Allgather_impl(239)....................:
MPIR_Allgather_intra_auto(145)..............: Failure during collective
MPIR_Allgather_intra_auto(141)..............:
MPIR_Allgather_intra_recursive_doubling(108):
MPIC_Sendrecv(340)..........................:
MPIDI_CH3U_Request_unpack_uebuf(516)........: Message truncated; 128 bytes received but buffer size is 32
MPIR_Allgather_intra_recursive_doubling(108):
MPIDI_CH3U_Receive_data_found(131)..........: Message from rank 2 and tag 7 truncated; 256 bytes received but buffer size is 128
The text was updated successfully, but these errors were encountered:
I don't know details about your simulation, but if you use any random method, there is no promise that the same random seed reproduces the same result on different machines.
I don't know details about your simulation, but if you use any random method, there is no promise that the same random seed reproduces the same result on different machines.
Ok. For above three error about MPI when calculate neb via lammps, Do you have any good suggestions to solve it.
Summary
I'm doing a dynamic simulation with a frozen model. The content of simulation is neb calculation and energy curve drawing. In the process of running the simulation. Let's say my initial state is A, my intermediate state is B, and my final state is C. I performed neb calculations of A-->B, A-->C on the M cluster. I moved the content to cluster N because the cluster permissions expired. In order to test the error, neb calculations of A-->B, A- >C are also performed in N cluster. The results show a slight difference in the energy curve, with an energy difference of 0.2 eV for the transition state. I want to ask is this caused by the cluster? In M cluster, deepmd is v2.2.6-1-g174f204a。In N cluster, deepmd is v2.2.6-1-g174f204a.
In addition, there may be an MPI error when I tried to do a B ->C calculation on the N cluster. I consulted the founder of the cluster and his testing suggestion may be a problem with deepmd itself. I really don't know what's causing it or how to fix it. Therefore, I am sending this issue to ask you for help. The following is an error output, and attached is an example.
DP-GEN Version
dpgen ==0.12.1
Platform, Python Version, etc
python=3.12.4
dpdata== 0.2.19
dpdispatcher==0.6.5
deepmd-kit v=2.2.6
Details
Error(1) :
ARNING: There was an error initializing an OpenFabrics device.
Local host: cn9
Local device: hfi1_0
MPI_ABORT was invoked on rank 12 in communicator MPI_COMM_WORLD
with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
[cn9:47344] 15 more processes have sent help message help-mpi-btl-openib.txt / error in device init
[cn9:47344] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[cn9:47344] 3 more processes have sent help message help-mpi-api.txt / mpi-abort
Error(2):
2024-12-05 18:51:06.726189: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled
2024-12-05 18:51:06.726267: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled
2024-12-05 18:51:06.725967: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 6
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 5
Error (3):
2024-12-05 15:11:25.911162: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled
2024-12-05 15:11:25.912789: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled
Fatal error in PMPI_Allgather: Other MPI error, error stack:
PMPI_Allgather(421).........................: MPI_Allgather(sbuf=0x7ffd3910cfd0, scount=4, MPI_DOUBLE, rbuf=0x55dc221066c0, rcount=4, datatype=MPI_DOUBLE, comm=comm=0x84000006) failed
MPIR_Allgather_impl(239)....................:
MPIR_Allgather_intra_auto(145)..............: Failure during collective
MPIR_Allgather_intra_auto(141)..............:
MPIR_Allgather_intra_recursive_doubling(216): Failure during collective
Fatal error in PMPI_Allgather: Message truncated, error stack:
PMPI_Allgather(421).........................: MPI_Allgather(sbuf=0x7ffcb0654ed0, scount=1, MPI_DOUBLE, rbuf=0x56528cc45a80, rcount=1, datatype=MPI_DOUBLE, comm=comm=0x84000006) failed
MPIR_Allgather_impl(239)....................:
MPIR_Allgather_intra_auto(145)..............: Failure during collective
MPIR_Allgather_intra_auto(141)..............:
MPIR_Allgather_intra_recursive_doubling(108):
MPIC_Sendrecv(340)..........................:
MPIDI_CH3U_Request_unpack_uebuf(516)........: Message truncated; 128 bytes received but buffer size is 32
MPIR_Allgather_intra_recursive_doubling(108):
MPIDI_CH3U_Receive_data_found(131)..........: Message from rank 2 and tag 7 truncated; 256 bytes received but buffer size is 128
The text was updated successfully, but these errors were encountered: