Person Description Model
-Person Search With Natural Language Description-
DISCRIPTION
ort_inputs = { "images": np.zeros([1, 3, 384, 128], dtype=np.float32), "txt": np.expand_dims(np.array(tokens["input_ids"], dtype=np.int64), axis=0), "attention_mask": np.ones([1,64], dtype=np.int64) }
ort_outs = ort_session.run(None, ort_inputs)
text_emb = ort_outs[1] # 1,2048
The example code above shows how to use this model. The ort_session is an ONNX inference session.
During inference, the input image is not needed, because the video to be searched must be pre-indexed. It can be simply an array of all zeros. The txt input field is the tokenized search query. The tokenizer is the same as the WangchanBERTa model. Since the model must be able to see the entire query text, attention_mask is all ones.
The output text_emb is the embedding (numpy array of size 1x2048) of the description text. It can be compared with the pre-calculated image embedding of each person in the video using approximate nearest neighbour search.
Tags :