Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent results when predicting a single sentence versus predicting labels for dev set #30

Closed
Alaska47 opened this issue Jun 25, 2019 · 4 comments

Comments

@Alaska47
Copy link

Alaska47 commented Jun 25, 2019

I'm currently trying to modify the code to predict the labels for an arbitrary sentence (not part of the train, dev, or test set). I was getting weird behavior when looking at the labels, particularly the padding tokens had its own label and some of the tokens were not being labelled correctly.

This discrepancy even existed when I tried to predict the labels for one of the sentences in my dev set. I was able to get the predicted labels from the dev set using the code from #16. When I compared the predicted labels from the dev test to the same single sentence prediction, some of the labels were off.

I decided to look at the feature vectors that were generated in the preprocessing stage for both of sentences and surprisingly, both vectors were almost the exact same (except for some padding).

token vector:

sentence from dev set
[     0  18152  18152  18152  18152   1279  12659   3128      4      3
    338      7  18152  18152  18152  18152 243801      6   4809   7282
  50959     11  31481   1020  89238      6  33746      7  10587   2578
    813 243021     40   1020     11     60      7     92      4     40
     10   4704   9889      4    106     21  12404    334      7   7746
      7   1296   3995      4   2486   4816      7  10587      4     11
      3    220      7     10  59651      5      0      0      0      0
      0]

single sentence:
[     0  18152  18152  18152  18152   1279  12659   3128      4      3
    338      7  18152  18152  18152  18152 243801      6   4809   7282
  50959     11  31481   1020  89238      6  33746      7  10587   2578
    813 243021     40   1020     11     60      7     92      4     40
     10   4704   9889      4    106     21  12404    334      7   7746
      7   1296   3995      4   2486   4816      7  10587      4     11
      3    220      7     10  59651      5      0]

char vector:

sentence from dev set
[ 0 29 29 29 31 29 27 28 30 24 24 26 29 27 31 31 28 45  6  7 15  5  7 15
  5  3  4 12 18 15 20  7  4 16 20 15  5 36  4  6  7  3 20  7  3 18 34 29
 31 30 31 23 23 23 24 30 33 26 25 23 25 25 26 12 15  5  7 15  5 12 35 12
 17 12  4 38  4 18  8  3 12 15  5  4  7  3  9 12 17 38  9 12 14 12 15 12
  5  6  7  5  3 15  9  9 12  5  3  8  8  7  3 20  5 40  5  7 15  5 12 35
 12 17 12  4 38  4 18  7 37  4 20  7 14  7  5 18 34  4  7 14  8  7 20  3
  4 16 20  7  3  8  8  7  3 20  5  5 18 18 15 31 24 27 21 27 31 23 21 23
 30 21 31 33 33  3 34  4  7 20 40  3 15  9 17  3  5  4 18 34  3 17 17 36
  3 34  4  7 20  3 32 18 15  5 12  9  7 20  3 35 17  7 12 15  4  7 20 44
  3 17 36  4  6  7 20  7 12  5  5 12 14 16 17  4  3 15  7 18 16  5 20  7
  4 16 20 15 18 34  3  8  8 20  7 32 12  3  4 12 18 15 18 34 17 12 39  6
  4  4 18 16 32  6 36 14 18  9  7 20  3  4  7  9  7 39 20  7  7  5 18 34
  4  7 14  8  7 20  3  4 16 20  7 36  3 15  9  4  6  7  8 18 12 15  4  5
 18 34  3 32 18 14  8  3  5  5 21  0]

single sentence:
[ 0 29 29 29 31 29 27 28 30 24 24 26 29 27 31 31 28 45  6  7 15  5  7 15
  5  3  4 12 18 15 20  7  4 16 20 15  5 36  4  6  7  3 20  7  3 18 34 29
 31 30 31 23 23 23 24 30 33 26 25 23 25 25 26 12 15  5  7 15  5 12 35 12
 17 12  4 38  4 18  8  3 12 15  5  4  7  3  9 12 17 38  9 12 14 12 15 12
  5  6  7  5  3 15  9  9 12  5  3  8  8  7  3 20  5 40  5  7 15  5 12 35
 12 17 12  4 38  4 18  7 37  4 20  7 14  7  5 18 34  4  7 14  8  7 20  3
  4 16 20  7  3  8  8  7  3 20  5  5 18 18 15 31 24 27 21 27 31 23 21 23
 30 21 31 33 33  3 34  4  7 20 40  3 15  9 17  3  5  4 18 34  3 17 17 36
  3 34  4  7 20  3 32 18 15  5 12  9  7 20  3 35 17  7 12 15  4  7 20 44
  3 17 36  4  6  7 20  7 12  5  5 12 14 16 17  4  3 15  7 18 16  5 20  7
  4 16 20 15 18 34  3  8  8 20  7 32 12  3  4 12 18 15 18 34 17 12 39  6
  4  4 18 16 32  6 36 14 18  9  7 20  3  4  7  9  7 39 20  7  7  5 18 34
  4  7 14  8  7 20  3  4 16 20  7 36  3 15  9  4  6  7  8 18 12 15  4  5
 18 34  3 32 18 14  8  3  5  5 21  0]

seq len vector:

sentence from dev set
[65]
single sentence:
[65]

tok len vector:

sentence from dev set
[ 1  4  4  4  4  4  9  7  1  3  4  2  4  4  4  4 13  2  4  8 10  3 10  1
 11  2  8  2 11  7  4 14  5  1  3  4  2  3  1  5  1 12  8  1  5  2 12  6
  2 12  2  5  5  1  8  7  2 11  1  3  3  6  2  1  7  1  1  0  0  0  0]

single sentence:
[ 1  4  4  4  4  4  9  7  1  3  4  2  4  4  4  4 13  2  4  8 10  3 10  1
 11  2  8  2 11  7  4 14  5  1  3  4  2  3  1  5  1 12  8  1  5  2 12  6
  2 12  2  5  5  1  8  7  2 11  1  3  3  6  2  1  7  1  1]

Predictions:

sentence from dev set
[3, 4, 4, 5, 0, 0, 0, 0, 0, 0, 0, 3, 4, 4, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

single sentence (without the padding tokens):
[3, 4, 4, 5, 8, 16, 6, 8, 6, 0, 0, 3, 4, 4, 4, 5, 8, 14, 8, 0, 6, 0, 8, 0, 8, 0, 6, 0, 6, 0, 10, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 16, 8, 8, 8, 6, 6, 6,6, 8, 0, 8, 0, 6, 8, 6, 6, 6, 6, 0, 0]

You can see that some of the labels are similar (i.e 3 4 5 are a B, I, L group for a certain class and 10 is a U group for a certain class). However, the single sentence prediction has a bunch of additional predicted classes for the exact same sentence with virtually the same feature vectors.

I'm wondering if there is some internal state maintained in the model with context which would indicate why the results when running on the dev set is different than a single sentence prediction. I didn't enable the documents flag so I thought each sentence would be treated as separate. If this is not the case, how can I train the model to treat each sentence separately?

If there is no internal state maintained in the model, could padding (which is the only difference between the feature vectors) play a large role in the prediction of classes? In that case, how should I be padding the feature vectors for single sentence prediction? What I'm doing right now is just padding it to the maximum length of each feature vector in the batch, except for 1 sentence, the batch size is 1.

Thanks in advance.

@strubell
Copy link
Member

strubell commented Jun 25, 2019 via email

@Alaska47
Copy link
Author

No, it happens when I try to dynamically generate the feature vectors for a single sentence. To clarify, running the model on an example from the dev set using a batch size of 1 still gives me accurate labels. From this, I'm guessing that there's no internal state maintained in the model, so the results should be deterministic.

However, when I try to dynamically generate the vectors for a single sentence, I'm getting inaccurate labels even though the input vectors for both are virtually the same. I'm going to run a couple tests to see if using the same padding changes the result at all. I'll also post my code below if you want to take a look at that.

Most of the code is pieced together from your existing methods, just modified to work on a single sentence standalone.

Preprocessing

def single_sentence_preprocess(sentence, token_map, shape_map, char_map, token_int_str_map, shape_int_str_map, char_int_str_map):

    update_vocab = True
    update_chars = True
    pad_width = 1

    lines = word_tokenize(sentence)

    sent_len = len(lines)
    max_word_len = max(map(len, [word for word in lines]))
    max_len_with_pad = 2 + sent_len

    oov_count = 0
    if sent_len == 0:
        return 0, 0, 0

    tokens = np.zeros(max_len_with_pad, dtype=np.int64)
    shapes = np.zeros(max_len_with_pad, dtype=np.int64)
    chars = np.zeros(max_len_with_pad*max_word_len, dtype=np.int64)
    sent_lens = []
    tok_lens = []

    tokens[:pad_width] = token_map[PAD_STR]
    shapes[:pad_width] = shape_map[PAD_STR]
    chars[:pad_width] = char_map[PAD_STR]

    tok_lens.extend([1]*pad_width)
    current_sent_len = 0
    char_start = pad_width
    idx = pad_width

    for token_str in lines:
        current_sent_len += 1
        token_shape = shape(token_str)
        token_str_normalized = re.sub("\d", "0", token_str)
        if token_shape not in shape_map:#  and update_vocab:
            shape_map[token_shape] = len(shape_map)
            print("LOGGING: shape map updated")

        for char in token_str:
            if char not in char_map and update_chars:
                char_map[char] = len(char_map)
                char_int_str_map[char_map[char]] = char
                print("LOGGING: char map updated")
        tok_lens.append(len(token_str))

        if token_str_normalized not in token_map:
            oov_count += 1
            if update_vocab:
                print("LOGGING: token map updated with " + token_str_normalized)
                token_map[token_str_normalized] = len(token_map)
                token_int_str_map[token_map[token_str_normalized]] = token_str_normalized

        tokens[idx] = token_map.get(token_str_normalized, token_map[OOV_STR])
        shapes[idx] = shape_map[token_shape] # if update_vocab else shape_map.get(token_shape, shape_map[token_shape[0]])
        chars[char_start:char_start+tok_lens[-1]] = [char_map.get(char, char_map[OOV_STR]) for char in token_str]
        char_start += tok_lens[-1]
        idx += 1

    sent_lens.append(current_sent_len)
    current_sent_len = 0
    tokens[idx:idx + pad_width] = token_map[PAD_STR]
    shapes[idx:idx + pad_width] = shape_map[PAD_STR]
    chars[char_start:char_start + pad_width] = char_map[PAD_STR]
    char_start += pad_width
    tok_lens.extend([1] * pad_width)
    idx += pad_width

    #tokens[idx:idx+pad_width] = token_map[PAD_STR]
    #shapes[idx:idx+pad_width] = shape_map[PAD_STR]
    #chars[char_start:char_start+pad_width] = char_map[PAD_STR]
    #char_start += pad_width
    #tok_lens.extend([1] * pad_width)

    padded_len = 1*(len(sent_lens)+1)*pad_width+sum(sent_lens)
    tokens = tokens[:padded_len]
    shapes = shapes[:padded_len]
    chars = chars[:sum(tok_lens)]

    return tokens, shapes, chars, np.asarray(sent_lens), np.asarray(tok_lens)

def pad_to_length(x, m):
    return np.pad(x,(0, m - x.shape[0]), mode = 'constant')

def batch_sentence_preprocess(sentences, token_map, shape_map, char_map, token_int_str_map, shape_int_str_map, char_int_str_map):
    # need to normalize token, shapes, char, tok_lens
    batch_members = []

    for sentence in sentences:
        eval_token, eval_shape, eval_char, eval_seq_len, eval_tok_len = single_sentence_preprocess(sentence, token_map, shape_map, char_map, token_int_str_map, shape_int_str_map, char_int_str_map)
        batch_members.append((eval_token, eval_shape, eval_char, eval_seq_len, eval_tok_len))

    max_token_len = max([len(batch_member[0]) for batch_member in batch_members])
    max_shape_len = max([len(batch_member[1]) for batch_member in batch_members])
    max_char_len = max([len(batch_member[2]) for batch_member in batch_members])
    max_token_tok_len = max([len(batch_member[4]) for batch_member in batch_members])

    eval_token_batch = np.asarray([pad_to_length(x[0], max_token_len) for x in batch_members])
    eval_shape_batch = np.asarray([pad_to_length(x[1], max_shape_len) for x in batch_members])
    eval_char_batch = np.asarray([pad_to_length(x[2], max_char_len) for x in batch_members])
    eval_tok_len_batch = np.asarray([pad_to_length(x[4], max_token_tok_len) for x in batch_members])
    eval_seq_len_batch = np.asarray([x[3] for x in batch_members])

    mask_batch = np.zeros(eval_token_batch.shape)
    actual_seq_lens = np.add(np.sum(eval_seq_len_batch, axis=1), 1 * pad_width * ((eval_seq_len_batch != 0).sum(axis=1) + 1))
    for i, seq_len in enumerate(actual_seq_lens):
        mask_batch[i, :seq_len] = 1

    return eval_token_batch, eval_shape_batch, eval_char_batch, eval_tok_len_batch, eval_seq_len_batch, mask_batch

Predictions

def run_prediction(eval_batches, extra_text=""):
    predictions = []
    eval_token_batch, eval_shape_batch, eval_char_batch, eval_seq_len_batch, eval_tok_len_batch, eval_mask_batch = eval_batches

    batch_size, batch_seq_len = eval_token_batch.shape
    char_lens = np.sum(eval_tok_len_batch, axis=1)
    max_char_len = np.max(eval_tok_len_batch)
    eval_padded_char_batch = np.zeros((batch_size, max_char_len * batch_seq_len))
    for b in range(batch_size):
        char_indices = [item for sublist in [range(i * max_char_len, i * max_char_len + d) for i, d in
                                             enumerate(eval_tok_len_batch[b])] for item in sublist]
        eval_padded_char_batch[b, char_indices] = eval_char_batch[b][:char_lens[b]]

    #print(max_char_len)
    #print(eval_padded_char_batch)

    char_embedding_feeds = {
        char_embedding_model.input_chars: eval_padded_char_batch,
        char_embedding_model.batch_size: batch_size,
        char_embedding_model.max_seq_len: batch_seq_len,
        char_embedding_model.token_lengths: eval_tok_len_batch,
        char_embedding_model.max_tok_len: max_char_len
    }

    basic_feeds = {
        model.input_x1: eval_token_batch,
        model.input_x2: eval_shape_batch,
        model.input_y: np.zeros(eval_token_batch.shape),
        model.input_mask: eval_mask_batch,
        model.max_seq_len: batch_seq_len,
        model.batch_size: batch_size,
        model.sequence_lengths: eval_seq_len_batch
    }

    basic_feeds.update(char_embedding_feeds)
    total_feeds = basic_feeds.copy()

    #print(total_feeds)

    if FLAGS.viterbi:
        preds, transition_params = sess.run([model.predictions, model.transition_params], feed_dict=total_feeds)

        viterbi_repad = np.empty((batch_size, batch_seq_len))
        for batch_idx, (unary_scores, sequence_lens) in enumerate(zip(preds, eval_seq_len_batch)):
            viterbi_sequence, _ = tf.contrib.crf.viterbi_decode(unary_scores, transition_params)
            viterbi_repad[batch_idx] = viterbi_sequence
        predictions.append(viterbi_repad)
    else:
        preds, scores = sess.run([model.predictions, model.unflat_scores], feed_dict=total_feeds)
        predictions.append(preds)
    return predictions

if FLAGS.predict_only:
    f = open(FLAGS.sample_text_file_name).read().strip().split("\n")
    eval_token_batch, eval_shape_batch, eval_char_batch, eval_seq_len_batch, eval_tok_len_batch, eval_mask_batch = batch_sentence_preprocess(f, vocab_str_id_map, shape_str_id_map, char_str_id_map, vocab_id_str_map, shape_id_str_map, char_id_str_map)
    eval_batches = (eval_token_batch, eval_shape_batch, eval_char_batch, eval_seq_len_batch, eval_tok_len_batch, eval_mask_batch)
    predictions = run_prediction(eval_batches)[0]
    for batch in range(len(eval_token_batch)):
        eval_token = eval_token_batch[batch].tolist()
        for x in zip([vocab_id_str_map[each] for each in eval_token], [labels_id_str_map[each] for each in predictions[batch].tolist()]):
            print(x)
        print("*******")

@Alaska47
Copy link
Author

Never mind, it looks like I mixed up the order of eval_seq_len_batch and eval_tok_len_batch in this line

return eval_token_batch, eval_shape_batch, eval_char_batch, eval_tok_len_batch, eval_seq_len_batch, mask_batch

Changing it to

return eval_token_batch, eval_shape_batch, eval_char_batch, eval_seq_len_batch, eval_tok_len_batch, mask_batch

fixed the issue I was having :)

I also got the code to work to predict labels on a real-life example. I can put my stuff into a PR if you'd like (I'll try to clean it up a little bit as well)

@strubell
Copy link
Member

strubell commented Jun 26, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants