在tensorflow分布式训练过程中突然终止(终止)

news2024/11/15 15:27:25

问题

这是为那些将从服务器接收渐变的员工提供的培训功能,在计算权重和偏差后,将更新的渐变发送到服务器。代码如下:

def train():
  """Train CIFAR-10 for a number of steps."""

  g1 = tf.Graph()
  with g1.as_default():
    global_step = tf.contrib.framework.get_or_create_global_step()

    # Get images and labels for CIFAR-10.
    images, labels = cifar10.distorted_inputs()

    # Build a Graph that computes the logits predictions from the
    # inference model.
    logits = cifar10.inference(images)

    # Calculate loss.
    loss = cifar10.loss(logits, labels)
    grads  = cifar10.train_part1(loss, global_step)

    only_gradients = [g for g,_ in grads]
    only_vars = [v for _,v in grads]


    placeholder_gradients = []

    #with tf.device("/gpu:0"):
    for grad_var in grads :
        placeholder_gradients.append((tf.placeholder('float', shape=grad_var[0].get_shape()) ,grad_var[1]))



    feed_dict = {}

    for i,grad_var in enumerate(grads): 
       feed_dict[placeholder_gradients[i][0]] = np.zeros(placeholder_gradients[i][0].shape)



    # Build a Graph that trains the model with one batch of examples and
    # updates the model parameters.
    train_op = cifar10.train_part2(global_step,placeholder_gradients)

    class _LoggerHook(tf.train.SessionRunHook):
      """Logs loss and runtime."""

      def begin(self):
        self._step = -1
        self._start_time = time.time()

      def before_run(self, run_context):
        self._step += 1
        return tf.train.SessionRunArgs(loss)  # Asks for loss value.


      def after_run(self, run_context, run_values):
        if self._step % FLAGS.log_frequency == 0:
          current_time = time.time()
          duration = current_time - self._start_time
          self._start_time = current_time

          loss_value = run_values.results
          examples_per_sec = FLAGS.log_frequency * FLAGS.batch_size / duration
          sec_per_batch = float(duration / FLAGS.log_frequency)

          format_str = ('%s: step %d, loss = %.2f (%.1f examples/sec; %.3f '
                        'sec/batch)')
          print (format_str % (datetime.now(), self._step, loss_value,
                               examples_per_sec, sec_per_batch))

    with tf.train.MonitoredTrainingSession(
        checkpoint_dir=FLAGS.train_dir,
        hooks=[tf.train.StopAtStepHook(last_step=FLAGS.max_steps),
               tf.train.NanTensorHook(loss),
               _LoggerHook()],
        config=tf.ConfigProto(
            log_device_placement=FLAGS.log_device_placement, gpu_options=gpu_options)) as mon_sess:

      global port
      while not mon_sess.should_stop():

        gradients = mon_sess.run(only_gradients,feed_dict = feed_dict)
        # pickling the gradients
        send_data = pickle.dumps(gradients,pickle.HIGHEST_PROTOCOL)
        # finding size of pickled gradients
        to_send_size = len(send_data)
        # Sending the size of the gradients first
        send_size = pickle.dumps(to_send_size, pickle.HIGHEST_PROTOCOL)
        s.sendall(send_size)
        # sending the gradients
        s.sendall(send_data)
        recv_size = safe_recv(17, s)
        recv_size = pickle.loads(recv_size)
        recv_data = safe_recv(recv_size, s)
        gradients2 = pickle.loads(recv_data)
        #print("Recevied gradients of size: ", len(recv_data))
        feed_dict = {}


        for i,grad_var in enumerate(gradients2): 
           feed_dict[placeholder_gradients[i][0]] = gradients2[i]


        res = mon_sess.run(train_op, feed_dict=feed_dict)

但是当我运行这段代码时,工作进程在运行了一些步骤之后神秘地终止(killed)。这是输出:

Connecting to port 16001

Downloading cifar-10-binary.tar.gz 100.0% Successfully downloaded cifar-10-binary.tar.gz 170052171 bytes. WARNING:tensorflow:From
cifar10_train2.py:217: The name tf.gfile.Exists is deprecated. Please
use tf.io.gfile.exists instead.

W1206 14:32:34.514134 140128687093568 deprecation_wrapper.py:119] From
cifar10_train2.py:217: The name tf.gfile.Exists is deprecated. Please
use tf.io.gfile.exists instead.

WARNING:tensorflow:From cifar10_train2.py:218: The name
tf.gfile.DeleteRecursively is deprecated. Please use
tf.io.gfile.rmtree instead.

W1206 14:32:34.515784 140128687093568 deprecation_wrapper.py:119] From
cifar10_train2.py:218: The name tf.gfile.DeleteRecursively is
deprecated. Please use tf.io.gfile.rmtree instead.

WARNING:tensorflow:From cifar10_train2.py:219: The name
tf.gfile.MakeDirs is deprecated. Please use tf.io.gfile.makedirs
instead.

W1206 14:32:34.521829 140128687093568 deprecation_wrapper.py:119] From
cifar10_train2.py:219: The name tf.gfile.MakeDirs is deprecated.
Please use tf.io.gfile.makedirs instead.

WARNING:tensorflow:From cifar10_train2.py:108:
get_or_create_global_step (from
tensorflow.contrib.framework.python.ops.variables) is deprecated and
will be removed in a future version. Instructions for updating: Please
switch to tf.train.get_or_create_global_step W1206 14:32:35.968658
140128687093568 deprecation.py:323] From cifar10_train2.py:108:
get_or_create_global_step (from
tensorflow.contrib.framework.python.ops.variables) is deprecated and
will be removed in a future version. Instructions for updating: Please
switch to tf.train.get_or_create_global_step WARNING:tensorflow:From
/home/ubuntu/aws_share/DistNet-master/cifar10_input.py:346:
string_input_producer (from tensorflow.python.training.input) is
deprecated and will be removed in a future version. Instructions for
updating: Queue-based input pipelines have been replaced by tf.data.
Use
tf.data.Dataset.from_tensor_slices(string_tensor).shuffle(tf.shape(input_tensor, out_type=tf.int64)[0]).repeat(num_epochs). If shuffle=False, omit
the .shuffle(...). W1206 14:32:35.973685 140128687093568
deprecation.py:323] From
/home/ubuntu/aws_share/DistNet-master/cifar10_input.py:346:
string_input_producer (from tensorflow.python.training.input) is
deprecated and will be removed in a future version. Instructions for
updating: Queue-based input pipelines have been replaced by tf.data.
Use
tf.data.Dataset.from_tensor_slices(string_tensor).shuffle(tf.shape(input_tensor, out_type=tf.int64)[0]).repeat(num_epochs). If shuffle=False, omit
the .shuffle(...). WARNING:tensorflow:From
/home/ubuntu/geo-distNet/lib/python3.6/site-packages/tensorflow/python/training/input.py:278:
input_producer (from tensorflow.python.training.input) is deprecated
and will be removed in a future version. Instructions for updating:
Queue-based input pipelines have been replaced by tf.data. Use
tf.data.Dataset.from_tensor_slices(input_tensor).shuffle(tf.shape(input_tensor, out_type=tf.int64)[0]).repeat(num_epochs). If shuffle=False, omit
the .shuffle(...). W1206 14:32:35.978900 140128687093568
deprecation.py:323] From
/home/ubuntu/geo-distNet/lib/python3.6/site-packages/tensorflow/python/training/input.py:278:
input_producer (from tensorflow.python.training.input) is deprecated
and will be removed in a future version. Instructions for updating:
Queue-based input pipelines have been replaced by tf.data. Use
tf.data.Dataset.from_tensor_slices(input_tensor).shuffle(tf.shape(input_tensor, out_type=tf.int64)[0]).repeat(num_epochs). If shuffle=False, omit
the .shuffle(...). WARNING:tensorflow:From
/home/ubuntu/geo-distNet/lib/python3.6/site-packages/tensorflow/python/training/input.py:190:
limit_epochs (from tensorflow.python.training.input) is deprecated and
will be removed in a future version. Instructions for updating:
Queue-based input pipelines have been replaced by tf.data. Use
tf.data.Dataset.from_tensors(tensor).repeat(num_epochs). W1206
14:32:35.979965 140128687093568 deprecation.py:323] From
/home/ubuntu/geo-distNet/lib/python3.6/site-packages/tensorflow/python/training/input.py:190:
limit_epochs (from tensorflow.python.training.input) is deprecated and
will be removed in a future version. Instructions for updating:
Queue-based input pipelines have been replaced by tf.data. Use
tf.data.Dataset.from_tensors(tensor).repeat(num_epochs).
WARNING:tensorflow:From
/home/ubuntu/geo-distNet/lib/python3.6/site-packages/tensorflow/python/training/input.py:199:
QueueRunner.init (from
tensorflow.python.training.queue_runner_impl) is deprecated and will
be removed in a future version. Instructions for updating: To
construct input pipelines, use the tf.data module. W1206
14:32:35.981818 140128687093568 deprecation.py:323] From
/home/ubuntu/geo-distNet/lib/python3.6/site-packages/tensorflow/python/training/input.py:199:
QueueRunner.init (from
tensorflow.python.training.queue_runner_impl) is deprecated and will
be removed in a future version. Instructions for updating: To
construct input pipelines, use the tf.data module.
WARNING:tensorflow:From
/home/ubuntu/geo-distNet/lib/python3.6/site-packages/tensorflow/python/training/input.py:199:
add_queue_runner (from tensorflow.python.training.queue_runner_impl)
is deprecated and will be removed in a future version. Instructions
for updating: To construct input pipelines, use the tf.data module.
W1206 14:32:35.983106 140128687093568 deprecation.py:323] From
/home/ubuntu/geo-distNet/lib/python3.6/site-packages/tensorflow/python/training/input.py:199:
add_queue_runner (from tensorflow.python.training.queue_runner_impl)
is deprecated and will be removed in a future version. Instructions
for updating: To construct input pipelines, use the tf.data module.
WARNING:tensorflow:From
/home/ubuntu/aws_share/DistNet-master/cifar10_input.py:79:
FixedLengthRecordReader.init (from tensorflow.python.ops.io_ops)
is deprecated and will be removed in a future version. Instructions
for updating: Queue-based input pipelines have been replaced by
tf.data. Use tf.data.FixedLengthRecordDataset. W1206
14:32:35.987178 140128687093568 deprecation.py:323] From
/home/ubuntu/aws_share/DistNet-master/cifar10_input.py:79:
FixedLengthRecordReader.init (from tensorflow.python.ops.io_ops)
is deprecated and will be removed in a future version. Instructions
for updating: Queue-based input pipelines have been replaced by
tf.data. Use tf.data.FixedLengthRecordDataset.
WARNING:tensorflow:From
/home/ubuntu/aws_share/DistNet-master/cifar10_input.py:359: The name
tf.random_crop is deprecated. Please use tf.image.random_crop instead.

W1206 14:32:36.005118 140128687093568 deprecation_wrapper.py:119] From
/home/ubuntu/aws_share/DistNet-master/cifar10_input.py:359: The name
tf.random_crop is deprecated. Please use tf.image.random_crop instead.

WARNING:tensorflow:From
/home/ubuntu/geo-distNet/lib/python3.6/site-packages/tensorflow/python/ops/image_ops_impl.py:1514:
div (from tensorflow.python.ops.math_ops) is deprecated and will be
removed in a future version. Instructions for updating: Deprecated in
favor of operator or tf.math.divide. W1206 14:32:36.057567
140128687093568 deprecation.py:323] From
/home/ubuntu/geo-distNet/lib/python3.6/site-packages/tensorflow/python/ops/image_ops_impl.py:1514:
div (from tensorflow.python.ops.math_ops) is deprecated and will be
removed in a future version. Instructions for updating: Deprecated in
favor of operator or tf.math.divide. Filling queue with 20000 CIFAR
images before starting to train. This will take a few minutes.
WARNING:tensorflow:From
/home/ubuntu/aws_share/DistNet-master/cifar10_input.py:126:
shuffle_batch (from tensorflow.python.training.input) is deprecated
and will be removed in a future version. Instructions for updating:
Queue-based input pipelines have been replaced by tf.data. Use
tf.data.Dataset.shuffle(min_after_dequeue).batch(batch_size). W1206
14:32:36.059510 140128687093568 deprecation.py:323] From
/home/ubuntu/aws_share/DistNet-master/cifar10_input.py:126:
shuffle_batch (from tensorflow.python.training.input) is deprecated
and will be removed in a future version. Instructions for updating:
Queue-based input pipelines have been replaced by tf.data. Use
tf.data.Dataset.shuffle(min_after_dequeue).batch(batch_size).
WARNING:tensorflow:From
/home/ubuntu/aws_share/DistNet-master/cifar10_input.py:135: The name
tf.summary.image is deprecated. Please use tf.compat.v1.summary.image
instead.

W1206 14:32:36.075527 140128687093568 deprecation_wrapper.py:119] From
/home/ubuntu/aws_share/DistNet-master/cifar10_input.py:135: The name
tf.summary.image is deprecated. Please use tf.compat.v1.summary.image
instead.

WARNING:tensorflow:From
/home/ubuntu/aws_share/DistNet-master/cifar10.py:374: The name
tf.variable_scope is deprecated. Please use
tf.compat.v1.variable_scope instead.

W1206 14:32:36.079024 140128687093568 deprecation_wrapper.py:119] From
/home/ubuntu/aws_share/DistNet-master/cifar10.py:374: The name
tf.variable_scope is deprecated. Please use
tf.compat.v1.variable_scope instead.

WARNING:tensorflow:From
/home/ubuntu/aws_share/DistNet-master/cifar10.py:135: calling
TruncatedNormal.init (from tensorflow.python.ops.init_ops) with
dtype is deprecated and will be removed in a future version.
Instructions for updating: Call initializer instance with the dtype
argument instead of passing it to the constructor W1206
14:32:36.079606 140128687093568 deprecation.py:506] From
/home/ubuntu/aws_share/DistNet-master/cifar10.py:135: calling
TruncatedNormal.init (from tensorflow.python.ops.init_ops) with
dtype is deprecated and will be removed in a future version.
Instructions for updating: Call initializer instance with the dtype
argument instead of passing it to the constructor
WARNING:tensorflow:From
/home/ubuntu/aws_share/DistNet-master/cifar10.py:111: The name
tf.get_variable is deprecated. Please use tf.compat.v1.get_variable
instead.

W1206 14:32:36.080104 140128687093568 deprecation_wrapper.py:119] From
/home/ubuntu/aws_share/DistNet-master/cifar10.py:111: The name
tf.get_variable is deprecated. Please use tf.compat.v1.get_variable
instead.

WARNING:tensorflow:From
/home/ubuntu/aws_share/DistNet-master/cifar10.py:138: The name
tf.add_to_collection is deprecated. Please use
tf.compat.v1.add_to_collection instead.

W1206 14:32:36.089883 140128687093568 deprecation_wrapper.py:119] From
/home/ubuntu/aws_share/DistNet-master/cifar10.py:138: The name
tf.add_to_collection is deprecated. Please use
tf.compat.v1.add_to_collection instead.

WARNING:tensorflow:From
/home/ubuntu/aws_share/DistNet-master/cifar10.py:386: The name
tf.nn.max_pool is deprecated. Please use tf.nn.max_pool2d instead.

W1206 14:32:36.096186 140128687093568 deprecation_wrapper.py:119] From
/home/ubuntu/aws_share/DistNet-master/cifar10.py:386: The name
tf.nn.max_pool is deprecated. Please use tf.nn.max_pool2d instead.

WARNING:tensorflow:From
/home/ubuntu/aws_share/DistNet-master/cifar10.py:572: The name
tf.train.exponential_decay is deprecated. Please use
tf.compat.v1.train.exponential_decay instead.

W1206 14:32:36.164064 140128687093568 deprecation_wrapper.py:119] From
/home/ubuntu/aws_share/DistNet-master/cifar10.py:572: The name
tf.train.exponential_decay is deprecated. Please use
tf.compat.v1.train.exponential_decay instead.

WARNING:tensorflow:From
/home/ubuntu/aws_share/DistNet-master/cifar10.py:577: The name
tf.summary.scalar is deprecated. Please use
tf.compat.v1.summary.scalar instead.

W1206 14:32:36.170583 140128687093568 deprecation_wrapper.py:119] From
/home/ubuntu/aws_share/DistNet-master/cifar10.py:577: The name
tf.summary.scalar is deprecated. Please use
tf.compat.v1.summary.scalar instead.

INFO:tensorflow:Summary name conv1/weight_loss (raw) is illegal; using
conv1/weight_loss__raw_ instead. I1206 14:32:36.224623 140128687093568
summary_op_util.py:66] Summary name conv1/weight_loss (raw) is
illegal; using conv1/weight_loss__raw_ instead.
INFO:tensorflow:Summary name conv2/weight_loss (raw) is illegal; using
conv2/weight_loss__raw_ instead. I1206 14:32:36.230782 140128687093568
summary_op_util.py:66] Summary name conv2/weight_loss (raw) is
illegal; using conv2/weight_loss__raw_ instead.
INFO:tensorflow:Summary name local3/weight_loss (raw) is illegal;
using local3/weight_loss__raw_ instead. I1206 14:32:36.234036
140128687093568 summary_op_util.py:66] Summary name local3/weight_loss
(raw) is illegal; using local3/weight_loss__raw_ instead.
INFO:tensorflow:Summary name local4/weight_loss (raw) is illegal;
using local4/weight_loss__raw_ instead. I1206 14:32:36.236655
140128687093568 summary_op_util.py:66] Summary name local4/weight_loss
(raw) is illegal; using local4/weight_loss__raw_ instead.
INFO:tensorflow:Summary name softmax_linear/weight_loss (raw) is
illegal; using softmax_linear/weight_loss__raw_ instead. I1206
14:32:36.239247 140128687093568 summary_op_util.py:66] Summary name
softmax_linear/weight_loss (raw) is illegal; using
softmax_linear/weight_loss__raw_ instead. INFO:tensorflow:Summary name
cross_entropy (raw) is illegal; using cross_entropy__raw_ instead.
I1206 14:32:36.241847 140128687093568 summary_op_util.py:66] Summary
name cross_entropy (raw) is illegal; using cross_entropy__raw_
instead. INFO:tensorflow:Summary name total_loss (raw) is illegal;
using total_loss__raw_ instead. I1206 14:32:36.244410 140128687093568
summary_op_util.py:66] Summary name total_loss (raw) is illegal; using
total_loss__raw_ instead. WARNING:tensorflow:From
/home/ubuntu/aws_share/DistNet-master/cifar10.py:584: The name
tf.train.GradientDescentOptimizer is deprecated. Please use
tf.compat.v1.train.GradientDescentOptimizer instead.

W1206 14:32:36.247661 140128687093568 deprecation_wrapper.py:119] From
/home/ubuntu/aws_share/DistNet-master/cifar10.py:584: The name
tf.train.GradientDescentOptimizer is deprecated. Please use
tf.compat.v1.train.GradientDescentOptimizer instead.

WARNING:tensorflow:From
/home/ubuntu/aws_share/DistNet-master/cifar10.py:609: The name
tf.summary.histogram is deprecated. Please use
tf.compat.v1.summary.histogram instead.

W1206 14:32:36.383297 140128687093568 deprecation_wrapper.py:119] From
/home/ubuntu/aws_share/DistNet-master/cifar10.py:609: The name
tf.summary.histogram is deprecated. Please use
tf.compat.v1.summary.histogram instead.

WARNING:tensorflow:From
/home/ubuntu/geo-distNet/lib/python3.6/site-packages/tensorflow/python/training/moving_averages.py:433:
Variable.initialized_value (from tensorflow.python.ops.variables) is
deprecated and will be removed in a future version. Instructions for
updating: Use Variable.read_value. Variables in 2.X are initialized
automatically both in eager and graph (inside tf.defun) contexts.
W1206 14:32:36.406855 140128687093568 deprecation.py:323] From
/home/ubuntu/geo-distNet/lib/python3.6/site-packages/tensorflow/python/training/moving_averages.py:433:
Variable.initialized_value (from tensorflow.python.ops.variables) is
deprecated and will be removed in a future version. Instructions for
updating: Use Variable.read_value. Variables in 2.X are initialized
automatically both in eager and graph (inside tf.defun) contexts.
WARNING:tensorflow:From cifar10_train2.py:144: The name
tf.train.SessionRunHook is deprecated. Please use
tf.estimator.SessionRunHook instead.

W1206 14:32:36.700819 140128687093568 deprecation_wrapper.py:119] From
cifar10_train2.py:144: The name tf.train.SessionRunHook is deprecated.
Please use tf.estimator.SessionRunHook instead.

WARNING:tensorflow:From cifar10_train2.py:171: The name
tf.train.MonitoredTrainingSession is deprecated. Please use
tf.compat.v1.train.MonitoredTrainingSession instead.

W1206 14:32:36.701856 140128687093568 deprecation_wrapper.py:119] From
cifar10_train2.py:171: The name tf.train.MonitoredTrainingSession is
deprecated. Please use tf.compat.v1.train.MonitoredTrainingSession
instead.

WARNING:tensorflow:From cifar10_train2.py:173: The name
tf.train.StopAtStepHook is deprecated. Please use
tf.estimator.StopAtStepHook instead.

W1206 14:32:36.702125 140128687093568 deprecation_wrapper.py:119] From
cifar10_train2.py:173: The name tf.train.StopAtStepHook is deprecated.
Please use tf.estimator.StopAtStepHook instead.

INFO:tensorflow:Create CheckpointSaverHook. I1206 14:32:36.702475
140128687093568 basic_session_run_hooks.py:541] Create
CheckpointSaverHook. WARNING:tensorflow:From
/home/ubuntu/geo-distNet/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py:1354:
add_dispatch_support..wrapper (from
tensorflow.python.ops.array_ops) is deprecated and will be removed in
a future version. Instructions for updating: Use tf.where in 2.0,
which has the same broadcast rule as np.where W1206 14:32:36.849380
140128687093568 deprecation.py:323] From
/home/ubuntu/geo-distNet/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py:1354:
add_dispatch_support..wrapper (from
tensorflow.python.ops.array_ops) is deprecated and will be removed in
a future version. Instructions for updating: Use tf.where in 2.0,
which has the same broadcast rule as np.where INFO:tensorflow:Graph
was finalized. I1206 14:32:36.947587 140128687093568
monitored_session.py:240] Graph was finalized. 2019-12-06
14:32:36.948710: I tensorflow/core/platform/cpu_feature_guard.cc:142]
Your CPU supports instructions that this TensorFlow binary was not
compiled to use: AVX2 FMA 2019-12-06 14:32:36.957928: I
tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency:
2400045000 Hz 2019-12-06 14:32:36.958092: I
tensorflow/compiler/xla/service/service.cc:168] XLA service 0x405e200
executing computations on platform Host. Devices: 2019-12-06
14:32:36.958116: I tensorflow/compiler/xla/service/service.cc:175]
StreamExecutor device (0): , 2019-12-06
14:32:37.060010: W
tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time
warning): Not using XLA:CPU for cluster because envvar
TF_XLA_FLAGS=–tf_xla_cpu_global_jit was not set. If you want
XLA:CPU, either set that envvar, or use experimental_jit_scope to
enable XLA:CPU. To confirm that XLA is active, pass
–vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=–xla_hlo_profile.
INFO:tensorflow:Running local_init_op. I1206 14:32:37.153052
140128687093568 session_manager.py:500] Running local_init_op.
INFO:tensorflow:Done running local_init_op. I1206 14:32:37.165080
140128687093568 session_manager.py:502] Done running local_init_op.
WARNING:tensorflow:From
/home/ubuntu/geo-distNet/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py:875: start_queue_runners (from
tensorflow.python.training.queue_runner_impl) is deprecated and will
be removed in a future version. Instructions for updating: To
construct input pipelines, use the tf.data module. W1206
14:32:37.200688 140128687093568 deprecation.py:323] From
/home/ubuntu/geo-distNet/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py:875: start_queue_runners (from
tensorflow.python.training.queue_runner_impl) is deprecated and will
be removed in a future version. Instructions for updating: To
construct input pipelines, use the tf.data module.
INFO:tensorflow:Saving checkpoints for 0 into
/home/ubuntu/cifar10_train/model.ckpt. I1206 14:32:38.378327
140128687093568 basic_session_run_hooks.py:606] Saving checkpoints for
0 into /home/ubuntu/cifar10_train/model.ckpt. 2019-12-06
14:32:42.361013: W tensorflow/core/framework/allocator.cc:107]
Allocation of 18874368 exceeds 10% of system memory. 2019-12-06
14:32:42.897646: W tensorflow/core/framework/allocator.cc:107]
Allocation of 21196800 exceeds 10% of system memory. 2019-12-06
14:32:43.161692: W tensorflow/core/framework/allocator.cc:107]
Allocation of 9437184 exceeds 10% of system memory. 2019-12-06
14:32:43.169073: W tensorflow/core/framework/allocator.cc:107]
Allocation of 18874368 exceeds 10% of system memory. 2019-12-06
14:32:43.231100: W tensorflow/core/framework/allocator.cc:107]
Allocation of 16070400 exceeds 10% of system memory. 2019-12-06
14:32:43.285419: step 0, loss = 4.67 (197.6 examples/sec; 0.648
sec/batch) WARNING:tensorflow:It seems that global step
(tf.train.get_global_step) has not been increased. Current value
(could be stable): 0 vs previous value: 0. You could increase the
global step by passing tf.train.get_global_step() to
Optimizer.apply_gradients or Optimizer.minimize. W1206 14:32:43.988662
140128687093568 basic_session_run_hooks.py:724] It seems that global
step (tf.train.get_global_step) has not been increased. Current value
(could be stable): 0 vs previous value: 0. You could increase the
global step by passing tf.train.get_global_step() to
Optimizer.apply_gradients or Optimizer.minimize. WARNING:tensorflow:It
seems that global step (tf.train.get_global_step) has not been
increased. Current value (could be stable): 1 vs previous value: 1.
You could increase the global step by passing
tf.train.get_global_step() to Optimizer.apply_gradients or
Optimizer.minimize. W1206 14:32:45.315430 140128687093568
basic_session_run_hooks.py:724] It seems that global step
(tf.train.get_global_step) has not been increased. Current value
(could be stable): 1 vs previous value: 1. You could increase the
global step by passing tf.train.get_global_step() to
Optimizer.apply_gradients or Optimizer.minimize. WARNING:tensorflow:It
seems that global step (tf.train.get_global_step) has not been
increased. Current value (could be stable): 2 vs previous value: 2.
You could increase the global step by passing
tf.train.get_global_step() to Optimizer.apply_gradients or
Optimizer.minimize. W1206 14:32:46.506114 140128687093568
basic_session_run_hooks.py:724] It seems that global step
(tf.train.get_global_step) has not been increased. Current value
(could be stable): 2 vs previous value: 2. You could increase the
global step by passing tf.train.get_global_step() to
Optimizer.apply_gradients or Optimizer.minimize. WARNING:tensorflow:It
seems that global step (tf.train.get_global_step) has not been
increased. Current value (could be stable): 3 vs previous value: 3.
You could increase the global step by passing
tf.train.get_global_step() to Optimizer.apply_gradients or
Optimizer.minimize. W1206 14:32:47.667660 140128687093568
basic_session_run_hooks.py:724] It seems that global step
(tf.train.get_global_step) has not been increased. Current value
(could be stable): 3 vs previous value: 3. You could increase the
global step by passing tf.train.get_global_step() to
Optimizer.apply_gradients or Optimizer.minimize. WARNING:tensorflow:It
seems that global step (tf.train.get_global_step) has not been
increased. Current value (could be stable): 4 vs previous value: 4.
You could increase the global step by passing
tf.train.get_global_step() to Optimizer.apply_gradients or
Optimizer.minimize. W1206 14:32:48.857112 140128687093568
basic_session_run_hooks.py:724] It seems that global step
(tf.train.get_global_step) has not been increased. Current value
(could be stable): 4 vs previous value: 4. You could increase the
global step by passing tf.train.get_global_step() to
Optimizer.apply_gradients or Optimizer.minimize. 2019-12-06
14:32:50.199463: step 10, loss = 4.66 (185.1 examples/sec; 0.691
sec/batch) 2019-12-06 14:32:56.430896: step 20, loss = 4.64 (205.5
examples/sec; 0.623 sec/batch) Killed

搜索解决方案并找到以下命令:

dmesg -T| grep -E -i -B100 'killed process'

它显示了导致进程终止并获得以下输出的原因:

[Fri Dec 6 14:33:06 2019] [ pid ] uid tgid total_vm rss
pgtables_bytes swapents oom_score_adj name [Fri Dec 6 14:33:06 2019]
[ 8384] 1000 8384 64817 627 266240 0 0
(sd-pam) [Fri Dec 6 14:33:06 2019] [ 8513] 1000 8513 27163
427 253952 0 0 sshd [Fri Dec 6 14:33:06 2019] [
8519] 1000 8519 5803 437 86016 0 0
bash [Fri Dec 6 14:33:06 2019] [ 8558] 1000 8558 118072 11795
430080 0 0 jupyter-noteboo [Fri Dec 6 14:33:06
2019] [ 8567] 1000 8567 155530 8753 421888 0
0 python3 [Fri Dec 6 14:33:06 2019] [ 8588] 1000 8588 156391
9638 430080 0 0 python3 [Fri Dec 6 14:33:06
2019] [ 8826] 1000 8826 1157 16 49152 0
0 sh [Fri Dec 6 14:33:06 2019] [ 8827] 1000 8827 462846 175833
2416640 0 0 cifar10_train2. [Fri Dec 6 14:33:06
2019] Out of memory: Kill process 8827 (cifar10_train2.) score 700 or
sacrifice child [Fri Dec 6 14:33:06 2019] Killed process 8827
(cifar10_train2.) total-vm:1851384kB, anon-rss:703332kB, file-rss:0kB,
shmem-rss:0kB

这意味着内存不足的原因导致!

解决

[cRPD] Syslog Message: ‘Memory cgroup out of memory: Kill process (rpd) score or sacrifice child’

RPD 在启动容器时将利用比分配给它的内存+交换更多的内存+交换。例如,在以下情况下,生成的容器的内存+交换限制为 2G

docker run --rm -detach --name cRR1 -h cRR1 --privileged -v cRR1-config:/config -v cRR1-Log:/var/log  --memory="2g" --memory-swap="2g" -it crpd:20.3X75-D5.5 

如下面的 docker 统计信息输出所示,它显示容器正在利用分配给它的所有内存 - 2G

/home/regress# ps aux | grep "/usr/sbin/rpd -N"

在这里插入图片描述

任务内存输出显示容器在“22/05/30 09:56:20”上使用了大约 2G 的最大内存 - 这正是内存不足和 RPD 崩溃的时候

Memory cgroup out of memory” is a message coming from the kernel when it kills something because of a memory cgroup restriction, such as those we place on containers while spinning them.
“内存 cgroup 出内存”是来自内核的消息,当它由于内存 cgroup 限制而杀死某些内容时,例如我们在旋转容器时放置在容器上的那些。
One major thing that could lead to this event is if insufficient memory is allocated to container which is insufficient to handle the route scale.
可能导致此事件的一个主要原因是,如果分配给容器的内存不足,这不足以处理路由规模。
Another factor which could lead to memory contention is a huge route churn which pushes the memory to the edge of extinction.
另一个可能导致内存争用的因素是巨大的路由搅动,它将内存推向灭绝的边缘。
As a reference for RIB/FIB scale vs minimum memory required to support it, refer the documentation on cRPD Resource Requirements.
有关 RIB/FIB 规模与支持它所需的最小内存的参考,请参阅 cRPD 资源要求的文档。

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/839498.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

不规则文件转JSON

需求分析: 有时候,我们取出来的数据并不是一个规则的JSON文件,这个时候面对存库还是ES检索都是一个问题,所以我们就需要进行解析,然而用字符串分割是不现实的,我们需要一种快速的方法。 问题解决&#x…

设计模式行为型——备忘录模式

目录 什么是备忘录模式 备忘录模式的实现 备忘录模式角色 备忘录模式类图 备忘录模式举例 备忘录模式代码实现 备忘录模式的特点 优点 缺点 使用场景 注意事项 实际应用 什么是备忘录模式 备忘录模式(Memento Pattern)又叫做快照模式&#x…

ant.design 组件库中的 Tree 组件实现可搜索的树: React+and+ts

ant.design 组件库中的 Tree 组件实现可搜索的树,在这里我会详细介绍每个方法,以及容易踩坑的点。 效果图: 首先是要导入的文件 // React 自带的属性 import React, { useMemo, useState } from react; // antd 组件库中的,输入…

VSCode插件Todo Tree的使用

在VSCode中安装插件Todo Tree。按下快捷键ctrlshiftP,输入setting.jspn,选择相应的配置范围,我们选择的是用户配置 Open User Settings(JSON),将以下代码插入其中。 //todo-tree 标签配置从这里开始 标签兼容大小写字母(很好的功…

JAVA方向的大数据包含啥内容?

文章目录 大数据是啥大数据就业方向知识体系HadoophiveHBaseSparkScala 总结 大数据是啥 你了解到的大数据是啥样子? 还是… 大数据(big data),或称巨量资料,指的是所涉及的资料量规模巨大到无法透过主流软件工具,在合理时间…

麦肯锡战略思维四大原则

麦肯锡战略思维四大原则 曾任职麦肯锡、安永等国家国际知名咨询机构的周正元,在其著作《麦肯锡结构化战略思维》将其系统的整理呈现出来,便于学习和使用。 模型介绍 工作中的你,是不是经常遇到复杂问题,六神无主? 专业…

SpringBoot+SSM实战<一>:打造高效便捷的企业级Java外卖订购系统

文章目录 项目简介项目架构功能模块管理端用户端 技术选型用户层网关层应用层数据层工具 项目优缺点结语 黑马程序员最新Java项目实战《苍穹外卖》:让你轻松掌握SpringBootSSM的企业级开发技巧项目简介 《苍穹外卖》是一款为餐饮企业(餐厅、饭店&#x…

PBR材质理解整理

PBR Material 草履虫都能看懂的PBR讲解(迫真) 先前看了很多遍类似的了,结合《Unity Shader 入门精要》中的内容整理了下便于以后理解,以后有补充再添加。 光与材质相交会发生散射和吸收,散射改变光的方向&#xff0c…

MySQL主从复制——概念、原理、搭建过程

文章目录 1.主从复制概念2.主从复制原理3.主从复制结构的搭建3.1 主库配置3.2 从库配置 4.测试主从复制是否搭建成功5.主从复制的小结 DML(data manipulation language)是数据操纵语言:它们是SELECT、UPDATE、INSERT、DELETE,就象…

LeetCode 热题 100 JavaScript--108. 将有序数组转换为二叉搜索树

给你一个整数数组 nums &#xff0c;其中元素已经按 升序 排列&#xff0c;请你将其转换为一棵 高度平衡 二叉搜索树。 高度平衡 二叉树是一棵满足「每个节点的左右两个子树的高度差的绝对值不超过 1 」的二叉树。 提示&#xff1a; 1 < nums.length < 104 -104 < n…

怎么维护好自己的电脑

你的电脑已经成为你工作、学习、娱乐的最佳工具之一&#xff0c;但是如果你不做好电脑维护工作&#xff0c;就可能面临着电脑变慢、蓝屏、崩溃等问题。在这篇文章中&#xff0c;我们将介绍10个电脑维护步骤&#xff0c;让你的电脑更加稳定&#xff01; 为什么需要电脑维护&…

【Kubernetes部署篇】基于Ubuntu20.04操作系统搭建K8S1.23版本集群

文章目录 一、集群架构规划信息二、系统初始化准备(所有节点同步操作)三、安装kubeadm(所有节点同步操作)四、初始化K8S集群(master节点操作)五、添加Node节点到K8S集群中六、安装Calico网络插件七、测试CoreDNS可用性 一、集群架构规划信息 pod网段&#xff1a;10.244.0.0/16…

我的创作纪念日-三周年

机缘 工作很长时间之后&#xff0c;才发现知识的根本&#xff0c;还是在于积累。俗话说好记性不如烂笔头。不管是特定产品相关的知识还是系统类的知识&#xff0c;又或者是语言类的知识&#xff0c;都有很多知识点需要积累。有了积累之后&#xff0c;工作起来才游刃有余。 原来…

掌握Memory Profiler技巧:识别内存问题

关于作者&#xff1a;CSDN内容合伙人、技术专家&#xff0c; 从零开始做日活千万级APP。 专注于分享各领域原创系列文章 &#xff0c;擅长java后端、移动开发、人工智能等&#xff0c;希望大家多多支持。 目录 一、导读二、概览三、如何使用四、页面说明4.1 Java 和 Kotlin 分配…

snap xxx has “install-snap“ change in progress

error description * 系重复安装&#xff0c;进程冲突 solution 展示snap的改变 然后sudo snap abort 22即可终止该进程 之后重新运行install command&#xff5e;&#xff5e; PS: ubuntu有时候加载不出来&#xff0c;执行resolvectl flush-caches&#xff0c;清除dns缓存…

Red Hat 安装JDK与IntelliJ IDEA

目录 前言 Red Hat 安装 JDK 1、更新软件包列表 2、安装OpenJDK 3、验证安装 Red Hat 安装IntelliJ IDEA 1、下载 IntelliJ IDEA 2、解压缩 IntelliJ IDEA 安装包 3、移动 IntelliJ IDEA 到安装目录 4、启动 IntelliJ IDEA 前言 YUM是基于Red Hat的Linux发行版的一个…

3.PyCharm安装

PyCharm是由JetBrains推出的Python开发IDE,是最受欢迎的Python IDE之一。PyCharm为Python开发者提供了许多高级功能如代码自动完成、调试等。它使用智能引擎来分析代码,能够自动识别代码中的错误并提供快速修复方案。PyCharm适用于各种规模的项目,包括小型Python脚本和大型P…

PHP8的程序结构-PHP8知识详解

在做任何事情之前&#xff0c;都需要遵循一定的规则。在PHP8中&#xff0c;程序能够安照人们的意愿执行程序&#xff0c;主要依靠程序的流程控制语句。 不管多复杂的程序&#xff0c;都是由这些基本的语句组成的。语句是构造程序的基本单位。程序执行的过程就是执行程序语句的…

【C++】类和对象 - 下

目录 1. 再谈构造函数1.1 构造函数体赋值1.2 初始化列表1.3 explicit关键字 2. static成员2.1 概念2.2 特性 3. 友元3.1 友元函数3.2 友元类 4. 内部类5. 匿名对象6. 拷贝对象时的一些编译器优化 1. 再谈构造函数 1.1 构造函数体赋值 在创建对象时&#xff0c;编译器通过调用…

【供电并接电路】2021-12-31

缘由在实际应用中mos管的导通问题 - 电路设计论坛 - 电子技术论坛 - 广受欢迎的专业电子论坛!MOS管实际应用中的开通问题-硬件开发-CSDN问答 设计的思想是&#xff0c;当CH_5.2V有5.2V电压时&#xff0c;Q7导通&#xff0c;Q6产生UGS​压差&#xff0c;从而使Q6导通&#xff0c…