Universal Sentence Encoder Embedding Experiments: E2 Vs C2 Performance

Following our experiments this week with sentence and document-level embeddings, we noted the phenomenal speed of the Universal Sentence Encoder V4 when running in a CPU-only environment. In our original experiments we struggled to push the model past 180% CPU. Through trial and error, we've discovered that around 250K of combined input text in a single batch offers the highest throughput per thread. The number of input records does not appear to matter in our use case, as 250K worth of fulltext articles or 250K of a vastly larger number of individual sentences yields the same throughput. Note that this is unique to the DAN-based USE model and is NOT the case for the Transformer architectures, whose performance drops precipitously as the input text lengthens.

Using larger batch sizes with the base USE model yields no performance speedup in our use case, though batch sizes above 250K yield unstable memory consumption patterns, where memory usage can vary by many gigabytes during the course of a run. With a batch size of 1MB and 4 submission workers, memory consumption of Tensorflow Server varied from 1.7GB to 6GB continuously, jumping and dropping second by second. This did not appear to negatively impact throughput, but any increase past 250K batch sizes appears to yield no speed improvement, only larger and larger swings in memory consumption.

On our quad-core system we found that the best throughput is achieved by matching the number of input threads to the number of cores. In our case, this means that we run four processes, each connecting to the same Tensorflow Server and running 250K batches of text serially: submit a 250K batch of text, block until the server finishes sending its response, then send the next back to 250K text and so on.

Given the base USE model's considerable CPU-only performance, we wanted to test how much of an improvement we would see when moving from our original E2 family system (cost-optimized) to a C2 family system (maximal CPU performance for compute intensive workloads). In particular, C2 family systems support AVX512 instructions, which the particular binary packed with Tensorflow Server is optimized for: "[external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA"

Given that the base USE model is not particularly memory-intensive, we only need 16GB of RAM to run it. Thus, we switched from an "e2-highmem-4" to a "c2-standard-4" VM.

The move from E2 to C2 yielded an almost exact 2x speedup, doubling the speed of all inferencing tasks in the base USE model, including both fulltext and sentence-level inference.

Optimal batch size appears to be the same between the two families, with 250K batch sizes still yielding the optimal throughput.

Other options tested include "tensorflow_session_parallelism" and also "tensorflow_intra_op_parallelism" and "tensorflow_inter_op_parallelism," but a range of different settings for these parameters had no impact on inference performance. We also tested enabling batching ("enable_batching") and testing a range of values for "max_batch_size," "batch_timeout_micros," "max_enqueued_batches," and "num_batch_threads," but adjusting these parameters only reduced inference performance, rather than accelerate it.

The only parameter that had a strong measurable impact related to the docker container itself: removing network proxying lead to a 20% speedup in many of our use cases. This is set with "–net=host" yielding the following startup command:

docker run -t --net=host --restart always --name tf-serve-universal-sentence-encoder -v "/TENSORFLOW/models:/models" -e MODEL_NAME="universal-sentence-encoder" -t tensorflow/serving --rest_api_port=8501 --enable_model_warmup=true&

For AI workfloads in the cloud, the 2x speedup simply from switching the processor family of the exact same VM offers a reminder of how compute intensive these workloads are and that CPU-only AI workloads will gain substantial advantages simply by moving to a C2 family system.