WO2024206507A1

WO2024206507A1 - Region-aware pre-training for computer vision tasks

Info

Publication number: WO2024206507A1
Application number: PCT/US2024/021771
Authority: WO
Inventors: Dahun KIM; Anelia Angelova; Weicheng KUO
Original assignee: Google Llc
Priority date: 2023-03-27
Filing date: 2024-03-27
Publication date: 2024-10-03

Abstract

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for training an image encoder neural network. In particular, a system performs region-aware training of the image encoder neural network for a computer vision task by implementing cropped positional embeddings during training, e.g., during pre-training of the image encoder neural network.

Description

Attorney Docket No. REGION-AWARE PRE-TRAINING FOR COMPUTER VISION TASKS CROSS-REFERENCE TO RELATED APPLICATION This application claims priority to U.S. Provisional Application No.63/454,938, filed on March 27, 2023. The disclosure of the prior application is considered part of and is incorporated by reference in the disclosure of this application. BACKGROUND This specification relates to processing images using neural networks. As one example, neural networks are machine learning models that employ one or more layers of nonlinear units to predict an output for a received input. Some neural networks include one or more hidden layers in addition to an output layer. The output of each hidden layer is used as input to another layer in the network, e.g., the next hidden layer or the output layer. Each layer of the network generates an output from a received input in accordance with current values of a respective set of weights. SUMMARY This specification describes a system implemented as computer programs on one or more computers that trains an image encoder neural network. The image encoder neural network is a neural network that receives a respective embedding of each patch in an input image and processes the respective embeddings to generate an image embedding of the input image in an embedding space. As a particular example, the image encoder can generate a respective output embedding for each of the patches of the image and then combine the respective output embeddings to generate the image embedding. In particular, the system trains the image encoder neural network using a region- aware training scheme. Particular embodiments of the subject matter described in this specification can be implemented so as to realize one or more of the following advantages. Contrastive learning techniques can be used to learn representations, e.g., of images, text, or both, that yield significant improvements when the representations are used for downstream tasks, e.g., image classification, image captioning, text-to-image search, and so on. Attorney Docket No. However, contrastive learning techniques generally optimize the image encoder neural network for image-level tasks, i.e., tasks that require generating a single prediction for the entire image, rather than region-level tasks. In particular, region-level tasks require operating on images having larger resolutions to allow the neural network to make precise region-level predictions, i.e., make respective (and potentially different) predictions for different regions within a given input image. As a result, there is a mismatch between the way the positional embeddings are used in existing contrastive pretraining approaches and the way the embeddings are used for region-level task, e.g., open-vocabulary detection, finetuning. The pretraining approaches typically apply full-image positional embeddings during training, and use the same positional embeddings for downstream tasks, e.g., zero-shot recognition. However, the recognition occurs at region-level for open-vocabulary detection finetuning, which requires the full-image positional embeddings to generalize to regions that are never seen during the pretraining due to the different resolutions of the training images used for the pre-training and downstream training tasks. This results in sub-optimal performance on the downstream task. To address this issue, this specification describes a technique for using cropped positional embeddings that, during pre-training, cause the model to view the input image as a “crop” from a larger image. This results in improved performance on the downstream task because the positional embeddings better match the downstream use case where recognition occurs at region- rather than image-level and where input images can have a higher resolution. An improved encoding of an image can therefore be provided. The details of one or more embodiments of the subject matter of this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages of the subject matter will become apparent from the description, the drawings, and the claims. BRIEF DESCRIPTION OF THE DRAWINGS FIG.1 shows an example neural network system. FIG.2 is a flow diagram of an example process for training the image encoder neural network. Attorney Docket No. FIG.3 is a flow diagram of an example process for determining positional embeddings for patches of an input image during the training of the image encoder neural network. FIG.4 shows an example of the pre-training and downstream training of the image encoder neural network. FIG.5 shows an example of the performance of a downstream neural network trained using the described techniques. Like reference numbers and designations in the various drawings indicate like elements. DETAILED DESCRIPTION FIG.1 shows an example neural network system 100. The neural network system 100 is an example of a system implemented as computer programs on one or more computers in one or more locations, in which the systems, components, and techniques described below can be implemented. The neural network system 100 is a system that trains an image encoder neural network 110 on training data 120 that includes a set of input images 122. The image encoder neural network 110 is a neural network that receives an input image 122 and processes the input image 122 to generate an image embedding 112 of the input image 122 in an embedding space. As a particular example, the image encoder neural network 110 can generate a respective embedding (“output embedding”) for each of multiple patches of the image and then combine the respective output embeddings to generate the image embedding 112. For example, the system 100 can pool the respective output embeddings, e.g., through global average pooling (GAP), to generate the image embedding 112. An “embedding” as used in this specification is a vector of numeric values, e.g., floating point values or other values, having a pre-determined dimensionality. The space of possible vectors having the pre-determined dimensionality is referred to as the “embedding space.” A “patch” of an image is a region of an image, e.g., so that the image is divided into non-overlapping regions called “patches.” More specifically, when an input image 122 is received, the system 100 (or, equivalently, the image encoder neural network 110), pre-processes the input image 122 Attorney Docket No. by dividing the image 122 into a plurality of patches and generating a respective patch embedding 124 of each patch of the image 122 from the intensity values of the pixels in the patch. Generating the patch embeddings is described in more detail below. The system 100 (or, equivalently, the image encoder neural network 110) also generates a respective positional embedding 126 for each patch of the image 122. Generally, the positional embeddings are based on the position of the patch within the image 122 and not on the intensity values of the pixels in the patch. The system 100 (or, equivalently, the image encoder neural network 110) then generates a respective combined embedding 128 of each patch by combining the respective patch embedding 124 and the respective positional embedding 128 of the patch. The image encoder neural network 110 then processes the combined embeddings 128 to generate the image embedding 112 of the image 122. Thus, the image encoder neural network 110 processes combined embeddings 128 for each patch, with the combined embedding 128 for a given patch being dependent on (i) the intensity values of the pixels within the patch and (ii) the location of the patch within the image 122. Therefore, the positional embeddings 126 provide the image encoder neural network 110 with information about what location each combined embedding 128 corresponds to within the image 122. In some implementations, the system 100 trains the image encoder neural network 110 and a text encoder neural network 130 through contrastive learning. The text encoder neural network 130 is a neural network that receives an input text segment 132 and processes the input text segment 132 to generate a text embedding 134 of the input segment 132 in the same embedding space as the image embedding 112, i.e., the image embeddings and text embeddings have the same dimensionality. In these implementations, the training data 120 also includes, for each input image 122, a corresponding text segment 132, i.e., the training data 120 can include multiple training pairs that each include an input image 122 and a text segment 132. In particular, the input text segment 132 in a given pair has been determined by the system 100 or an external source to describe the contents of the image 122 in the given pair or otherwise be relevant to the image 122 in the given pair. In other words, the Attorney Docket No. image 122 and the input text segment 132 each pair have been determined to be semantically similar to one another. For example, within a given training pair, the text segment 132 can be a text annotation of the image 122 from a set of manually or automatically generated image annotations or can be alt text associated with the image 122 in a set of alt-text data. Alt text is text that is displayed in place of an image on a web page, e.g., if the image cannot be rendered properly or otherwise fails to load. For example, the system 100 can obtain the alt-text data from data maintained by an Internet search engine or other software that automatically crawls web pages on the Internet. After being trained, the image encoder 110 and the text encoder 130 can be used for one or more downstream tasks. As a particular example, the downstream task can be a region-level task that requires making respective predictions for each of multiple regions of the input image 122. For example, the downstream task can be object detection, e.g., open vocabulary object detection. For example, the system 100 can fine-tune a down-stream neural network that includes the text encoder neural network, the image encoder neural network, and a detector neural network head that operates on the outputs of the image encoder neural network 110 on training data for the open vocabulary object detection task. The system 100 can then use the fine-tuned neural network to perform object detection. One example of performing open-vocabulary object detection will be described below with reference to FIG.4. As another example, the downstream task can be semantic segmentation. In semantic segmentation, the input is an image or a set of multiple images and the output assigns each of a plurality of pixels in the input image(s) to a respective object category from a set of object categories. As yet another example, the downstream task can be instance segmentation. In instance segmentation, the input is an image or a set of multiple images and the output assigns each of a plurality of pixels in the input image(s) to a respective object instance, with two pixels that are assigned the same instance depicting the same object instances and two pixels that are assigned different instances depicting different object instances. As yet another example, the downstream task can be panoptic segmentation. In panoptic segmentation, the input is an image or a set of multiple images and the output Attorney Docket No. assigns each of a plurality of pixels in the input image(s) to a respective object instance and to a respective object category. In many cases, e.g., because the downstream task is a region-level task or for other reasons, performing the downstream task can require the system 100 to operate on images that have larger resolutions than the images in the training data 120 that is used to perform the pre-training. Moreover, in the case of region-level tasks, performing the downstream task can require the system 100 to make respective predictions for each of multiple regions within these higher-resolution images. Contrastive learning techniques can be used to learn representations, e.g., of images, text, or both, that yield significant improvements when the representations are used for downstream tasks, e.g., image classification, image captioning, text-to-image search, and so on. However, contrastive learning techniques generally optimize the image encoder neural network 110 for image-level tasks, i.e., tasks that require generating a single prediction for the entire image, rather than region-level tasks. In particular, as described above, region-level tasks require operating on images having larger resolutions to allow the neural network to make precise region-level predictions. Because, given a patch size, images with different resolutions will be divided into different numbers of patches, the positional embeddings for region-level or other downstream tasks will need to provide information about a different set of image locations than those used during pre-training. As a result, there is a mismatch between the way the positional embeddings are used in existing contrastive pretraining approaches and the way the embeddings are used for region-level task, e.g., open-vocabulary detection, finetuning. The pretraining approaches typically apply full-image positional embeddings during training, and use the same positional embeddings for downstream tasks, e.g., zero- shot recognition. However, the recognition occurs at region-level for open-vocabulary detection finetuning, which requires the full-image positional embeddings to generalize to regions that they never see during the pretraining. This results in sub-optimal performance on the downstream task. Attorney Docket No. To address this issue, the system 100 uses cropped positional embeddings 126 during the pre-training to cause the neural network 110 to view the input image 122 as a “crop” from a larger image. This results in improved performance on the downstream task because the positional embeddings better match the downstream use case where recognition occurs at region- rather than image-level. Training the image encoder neural network 110 using cropped positional embeddings 126 is described in more detail below. The image encoder neural network 110 can have any appropriate architecture that allows the neural network 110 to map the combined embeddings 128 to the image embedding 112. For example, the image encoder neural network 110, can be a vision Transformer (ViT) neural network that has one or more self-attention layer blocks that each include one or more self-attention layers. Each self-attention layer receives a respective input embedding for each of the patches and applies a self-attention mechanism over the respective input embeddings to update the input embeddings. As another example, the image encoder neural network 110, can be a convolutional neural network. As yet another example, the image encoder neural network 110 can be a neural network that has a mix of both convolutional and self-attention layers. When used, the text encoder neural network 130 can have any appropriate architecture that allows the text encoder neural network 130 to map a text sequence to a text embedding. In a particular example, the text encoder neural network 130 can have an attention-based architecture, e.g., the architecture of an encoder-only, encoder-decoder, or decoder-only Transformer neural network. In this example, the text encoder neural network 130 can include a sequence of layers that includes one or more self-attention layers, where each attention layer is configured to receive as input a respective current representation of each of the text tokens in the current text sequence and to process the respective current representations to generate as output a respective updated representation of each of the text tokens in the current text sequence by applying self-attention mechanism over the respective current representations. Attorney Docket No. FIG.2 is a flow diagram of an example process 200 for training the image encoder neural network on a training task. For convenience, the process 200 will be described as being performed by a system of one or more computers located in one or more locations. For example, a neural network system, e.g., the neural network system 100 of FIG.1, appropriately programmed, can perform the process 200. The training task will also be referred to as a “first task.” In particular, the process 200 describes a training iteration performed during the training of the image encoder on training examples (“training pairs”) that each include a training image having a first size, i.e., a first resolution. The system maintains data specifying a respective positional embedding for each patch of a larger image having a second size (step 202). For example, the second size can be the size of the images that will be provided to the image encoder neural network when performing the downstream task or when training the image encoder neural network to perform the downstream task. As one example, the data can include a respective different positional embedding for each of the patches. As another example, the maintained data can include a respective positional embedding for each patch of a region having the first size (that is smaller than the second size). When needed, the system can then generate the respective positional embeddings for each of the patches of the larger image having the second size by interpolating the respective positional embeddings for the patches of the region having the first size to have the second size. The system obtains a batch of training examples (step 204). Each training example includes a respective training image having a first size that is less than the second size. That is, as described above, when performing pre-training, the system typically trains using images that are smaller in size than the images that will be used to perform the downstream task. The system then performs step 206-214 for each training example in the batch. The system divides the training image into a plurality of patches (step 206). For example, the system can partition the training image into a fixed number of patches all having the same, fixed size. The system generates a respective patch embedding of each patch (step 208). For example, for each patch, the system can process the intensity values of the pixels in the patch using a patch embedding subnetwork to generate the respective patch embedding. Attorney Docket No. The patch embedding subnetwork can be, e.g., a linear projection or multi-layer perceptron (MLP), and can be pre-trained or trained jointly with the image encoder neural network. The system generates a respective positional embedding for each patch of the training image using the maintained data (step 210). These positional embeddings will also be referred to as “cropped” positional embeddings because they cause the vision encoder neural network to view the training image as a “crop” from the larger image. That is, although the image has the first, smaller size and the respective positional embeddings in the maintained data are for patches of a larger image having the second size, the system nonetheless uses the embeddings in the maintained data to generate the positional embeddings of the patches in the image. Generating the cropped positional embeddings will be described below with reference to FIG.3. The system generates a respective combined embedding of each patch by combining the respective patch embedding and the respective positional embedding of the patch (step 212). For example, the system can, for each patch, sum the respective patch embedding and the respective positional embedding of the patch. The system processes the respective combined embeddings using the image encoder neural network to generate a training embedding of the training image (step 214). For example, as described above, the image encoder neural network can process the respective combined embeddings to generate a respective output embedding for each of the patches and then combine the output embeddings to generate the training embedding of the training image. When the image encoder neural network is a Vision Transformer neural network, the image encoder neural network can generate the respective output embeddings by processing the respective combined embeddings through a sequence of self-attention layer blocks. The system trains the image encoder neural network using the training embeddings of the training image on a first loss function for training t task (step 216). For example, as described above, the system can train the image encoder neural network through contrastive learning. That is, the training (“first”) task can be a task that includes a contrastive learning task and, optionally, one or more other tasks, e.g., a generative task, e.g., an image captioning task. Attorney Docket No. In this example, the system trains the image encoder jointly with a text encoder neural network having text encoder neural network parameters and configured to process a text segment to generate a text embedding of the text segment in the embedding space. In this example, each training example also includes a respective training text segment, and, in order to train the image encoder and the text decoder, the system processes the training text segment using the text encoder neural network to generate a training text embedding of the training text segment. The system then trains the image and text encoder neural networks on a loss that includes a contrastive loss function that is based on similarities between the training text embeddings and the training image embeddings. A particular example of a contrastive loss will be described next. Based on the embeddings for the images and the text segments in the N pairs in the batch, an N x N similarity matrix A is computed, where Ai;j is a value that represents how similar the embedding of x_i is to the embedding of y_j. For example, A_i;j can be the dot product between the embedding of xi and the embedding of yj. The system can then train the text encoder neural network and the image encoder neural network using gradients of a contrastive loss computed using the matrix A. For example, the contrastive loss can be the cross-entropy loss on the rows and columns of A, where the diagonal entries are treated as correct classes while other entries are treated as incorrect classes. A specific example of such a loss is: ಲ^{^,^ ಲ} _ೕ,ೕ ^^_^^^ ൌ െ ^{^} ഓ ே ^∑^ே ^_ୀ^ log ^ ^{^ ഓ} ಲ_^,ೕ ^ ^ ∑^ே ^_ୀ^ log ^ ^{^} ^ಲ _^,ೕ ^^ , ∑_{ೕ ^ ഓ} ∑_^ ^ ^ഓ where ^^ is the softmax temperature that scales the logits, e.g., which serves to steepen or dampen the softmax distributions in the rows and columns of A, and N is the total number of training pairs in the batch. In some cases, prior to computing the matrix A, the system normalizes the text and image embeddings of the images and text sequences in the batch. As this loss is minimized, for all pairs in the batch, the embeddings of x_i and y_i become closer together while becoming farther from all other embeddings of all other visual inputs and text segments in the batch, thereby achieving the goal of the contrastive learning. As another example, the system can instead use a focal contrastive loss. The focal contrastive loss includes a respective term for each pair of training examples that is based on a sigmoid function applied to the temperature-scaled dot product of the image Attorney Docket No. embedding for one of the examples in the pair and the text embedding for the other example in the pair. The focal contrastive loss can be expressed as: ^^{^ ௩^^ೕ ^^ ^^ ^^ ൌ ^} ^{^^} _^^^^^ ^{ൌ െ ^} ே ^{^ ே ^} ^_{ୀ^ ^ୀ^} ^{1 െ ^^} _^ ^{log ^^} _^ ^{^ , where ^^} _^ ^ൌ _^ ^{ఛ ^ ^} ^{∑ ∑ ே ^ ^ఊ ^ ,} ^_^ where the image

of the text embedding for the text sequence in the j-th pair in the batch. Alternatively, the focal contrastive loss can be expressed as the sum of a text to image focal loss and an image to text focal loss, as follows: ^^ ^{^} ^_^^^^ ൌ െ_ே ^∑^ே ^_ୀ^ ∑^ே ^_ୀ^ ^1 െ ^^_^^^ఊ log^ ^^ ^{^} ^^^ െ ^{ே ே} ே ^∑_^ୀ^ ∑_^ୀ^ ^1 െ are weighted

than what the softmax cross entropy loss can provide. Optionally, the overall loss can include the contrastive loss and one or more other losses, e.g., an image captioning loss or other losses computed using the text encoder. As another example, the training task may be a different task that does not use the text encoder neural network. As one example, the training task may be a contrastive learning task that uses training examples that include an image and a different modality of data that is not text, e.g., audio or point cloud data. In this example, the system trains the image encoder neural network jointly with an encoder that generates embeddings of the other modality of data. As another example, the training task may be an image-only training task that does not use another modality of data. Examples of such tasks include image classification tasks and image-only contrastive learning tasks. When the system trains the patch embedding subnetwork jointly with the image encoder neural network, the system also trains the patch embedding subnetwork using gradients of the first loss function. For example, the system can backpropagate gradients through the image encoder subnetwork to determine a gradient of the first loss function with respect to the parameters of the patch embedding subnetwork and then update the parameters of the patch embedding subnetwork by applying an appropriate optimizer to Attorney Docket No. the gradients. Generally, at least some training iterations, the system also updates the respective positional embeddings for the patches of the larger image having the second size using gradients of the first loss function. For example, when the maintained data includes a respective positional embedding for each patch of a region having the first size, the system can update the respective positional embeddings for the patches of the larger image having the second size using gradients of the first loss function by updating the respective positional embeddings for the patches of the region having the first size using gradients of the first loss function. The system can then update the respective positional embeddings for each of the patches of the larger image having the second size by interpolating the respective updated positional embeddings for the patches of the region having the first size to have the second size. By repeatedly performing the process 200 on different batches of training examples, the system can train the image encoder neural network to learn high quality representations of images. After training the image encoder neural network on the first loss function for the first task, the system can train a task neural network that includes the image encoder neural network to perform a computer vision task, i.e., the downstream task referred to above. For example, the computer vision task can require operating on images having the second size. As a particular example, the computer vision task can be object detection, e.g., an open vocabulary object detection that also makes use of the text encoder neural network. One example of training the image encoder neural network on an open vocabulary object detection task is described below with reference to FIG.4. FIG.3 is a flow diagram of an example process 300 for determining positional embeddings during training of the image encoder neural network. For convenience, the process 300 will be described as being performed by a system of one or more computers located in one or more locations. For example, a neural network system, e.g., the neural network system 100 of FIG.1, appropriately programmed, can perform the process 300. As described above, the system maintains data specifying a respective positional embedding for each patch of a larger image having a second size and uses the maintained data to determine positional embeddings for patches of training images having a smaller, first size. Attorney Docket No. For example, the maintained data can include a respective positional embedding for each patch of a region having the first size (that is smaller than the second size). When needed, the system can then generate the respective positional embeddings for each of the patches of the larger image having the second size by interpolating the respective positional embeddings for the patches of the region having the first size to have the second size. To determine the positional embeddings for a given training image, the system identifies a crop of the larger image (step 302). Generally, while the larger image has the second size that is larger than the first size, the crop has the first size and contains a plurality of patches that each correspond to one of the plurality of patches in the larger image and to one of the plurality of patches of the training image. That is, because the crop has the same first size as the training image, the system divides the crop into the same number of patches having the same size as the patches of the training image, resulting in the crop containing a plurality of patches that each correspond to one of the plurality of patches in the larger image and to one of the plurality of patches of the training image, i.e., with each patch in the crop corresponding to the patch of the training image that is in the corresponding location within the training image. In some implementations, the system uses the same crop for each training image in the batch. In some other implementations, the system computes respective, potentially different crops for each training image. Generally, to ensure that crops generated throughout training will provide good coverage of various regions of the larger image, the system generates each crop with some measure of randomness. As one example, the system can generate an initial crop of the larger image (step 304) and then re-size the initial crop to have the first size (step 306). For example, the system can randomly crop the larger image to generate the initial crop. As one example of this, the system can select a scale ratio for the initial crop, select an aspect ratio for the initial crop, and then randomly crop a region from the larger image that has the selected scale ratio and he selected aspect ratio. For example, the system can select a scale ratio for the initial crop by sampling from a distribution, e.g., a uniform distribution, over a set of possible scale ratios. As another example, the system Attorney Docket No. can select an aspect ratio for the initial crop by sampling from a distribution, e.g., a uniform distribution, over a set of possible aspect ratios. Thus, the crop covers a region of the larger image and the corresponding positional embeddings for each patch in the region as specified by the maintained data. The system can then re-size the crop to have the first size by applying interpolation, e.g., bi-linear interpolation or another appropriate interpolation technique, to the crop (and, therefore, to the positional embeddings for each patch in the region covered by the crop) to yield a re-sized crop that has the first size and that has a respective positional embedding for each patch of a region that has the first size. The system can then assign, to each patch in the training image, the respective positional embedding for the corresponding patch of the crop from the maintained data (step 308). That is, for each patch of the training image, the system assigns, as the positional embedding of the patch, the positional embedding of the corresponding patch in the re-sized crop. FIG.4 shows an example 400 of the operation of the system during pre-training. As shown in the example 400, the system receives a training image 402 and divides the training image into a plurality of patches. The system then processes each patch to generate a respective patch embedding of each of the patches of the training image 402. Additionally, the system generates a respective cropped positional embedding (CPE) 410 for each patch of the training image 402. In the example of FIG.4, the system generates the CPEs 410 by generating an initial crop 404 of the larger image 406 and then resizing the initial crop 404 to have the first size, resulting in a crop 408 that has the first size and that contains a plurality of patches that each correspond to one of the plurality of patches in the larger image and to one of the plurality of patches of the training image. The system then generates the positional embeddings by assigning, to each patch in the training image, the respective positional embedding for the corresponding patch of the crop from the maintained data. The system then generates a combined embedding for each patch by combining, e.g., by summing as shown in FIG.4, the patch embedding for the patch and the positional embedding for the patch. As shown in the example 400, the system processes the combined embeddings using the image encoder neural network 110 which, in the example of FIG.4, is a Vision Attorney Docket No. Transformer neural network, to generate a respective output embedding for each patch of the training image 402. The system then applies global average pooling (GAP) to the output embeddings to generate the training image embedding of the image 402. The system can then apply a contrastive loss on the training image embeddings and the text embeddings for the training examples in the batch to train the image encoder neural network and the text encoder neural network. FIG.4 also shows an example 450 of training a downstream neural network on an open-vocabulary object detection task. As shown in the example of FIG.4, the system generates the downstream neural network by replacing the GAP operation with a set of downstream detector heads. The downstream detector heads can be, e.g., mask R-CNN heads or other appropriate detector heads that can be used as part of an object detection neural network. During downstream training, the system processes combined embeddings of patches from the higher-resolution input images using the image encoder neural network to generate output embeddings, and the downstream detector heads process the output embeddings to generate region embeddings of feature region crops from the training image. Because the training images for downstream training have the second, higher resolution, the system can directly use the whole-image positional embeddings from the maintained data (e.g., generated by interpolating the embeddings of patches of images of the first size) for the downstream training, rather than needing to perform CPE. These region embeddings can then be used to train the downstream neural network, e.g., based on a loss that measures a similarity between the region embedding for each feature region crop and a ground truth embedding for the region crop. For example, the ground truth embedding can be a text embedding for an object depicted within the feature region crop that is generated by processing a label for the object using the text encoder neural network. After training, the system can perform open-vocabulary object detection using region embeddings (and, optionally, RoI-aligned embeddings of the regions generated by the image encoder neural network) and text embeddings of a desired set of object categories. For example, the system can generate, for each region, a respective similarity score for each object category using the text embedding of the object category and the Attorney Docket No. region embedding for the region (and, optionally, the RoI-aligned embedding of the region) and then select the category having the most similar similarity score. FIG.5 shows an example 500 of the performance of a downstream neural network trained using the described techniques on an open-vocabulary object detection task. In particular, FIG.5 shows the performance of three variants of a downstream neural network trained using the described techniques (RO-ViT) as compared to variants of other downstream neural networks trained using other techniques. As can be seen from FIG.5, the described techniques significantly outperform existing techniques, including those that are based on the same architecture (Owl ViT) and those that are based on convolutional neural networks. This specification uses the term “configured” in connection with systems and computer program components. For a system of one or more computers to be configured to perform particular operations or actions means that the system has installed on it software, firmware, hardware, or a combination of them that in operation cause the system to perform the operations or actions. For one or more computer programs to be configured to perform particular operations or actions means that the one or more programs include instructions that, when executed by data processing apparatus, cause the apparatus to perform the operations or actions. Embodiments of the subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, in tangibly-embodied computer software or firmware, in computer hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible non transitory storage medium for execution by, or to control the operation of, data processing apparatus. The computer storage medium can be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them. Alternatively or in addition, the program instructions can be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus. The term “data processing apparatus” refers to data processing hardware and encompasses all kinds of apparatus, devices, and machines for processing data, including Attorney Docket No. by way of example a programmable processor, a computer, or multiple processors or computers. The apparatus can also be, or further include, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit). The apparatus can optionally include, in addition to hardware, code that creates an execution environment for computer programs, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them. A computer program, which may also be referred to or described as a program, software, a software application, an app, a module, a software module, a script, or code, can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages; and it can be deployed in any form, including as a stand alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A program may, but need not, correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data, e.g., one or more scripts stored in a markup language document, in a single file dedicated to the program in question, or in multiple coordinated files, e.g., files that store one or more modules, sub programs, or portions of code. A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a data communication network. In this specification, the term “database” is used broadly to refer to any collection of data: the data does not need to be structured in any particular way, or structured at all, and it can be stored on storage devices in one or more locations. Thus, for example, the index database can include multiple collections of data, each of which may be organized and accessed differently. Similarly, in this specification the term “engine” is used broadly to refer to a software-based system, subsystem, or process that is programmed to perform one or more specific functions. Generally, an engine will be implemented as one or more software modules or components, installed on one or more computers in one or more locations. In some cases, one or more computers will be dedicated to a particular engine; in other cases, multiple engines can be installed and running on the same computer or computers. The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output. The processes and Attorney Docket No. logic flows can also be performed by special purpose logic circuitry, e.g., an FPGA or an ASIC, or by a combination of special purpose logic circuitry and one or more programmed computers. Computers suitable for the execution of a computer program can be based on general or special purpose microprocessors or both, or any other kind of central processing unit. Generally, a central processing unit will receive instructions and data from a read only memory or a random access memory or both. The essential elements of a computer are a central processing unit for performing or executing instructions and one or more memory devices for storing instructions and data. The central processing unit and the memory can be supplemented by, or incorporated in, special purpose logic circuitry. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks. However, a computer need not have such devices. Moreover, a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device, e.g., a universal serial bus (USB) flash drive, to name just a few. Computer readable media suitable for storing computer program instructions and data include all forms of non volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto optical disks; and CD ROM and DVD-ROM disks. To provide for interaction with a user, embodiments of the subject matter described in this specification can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input. In addition, a computer can interact with a user by sending documents to and receiving documents from a device that is used by the user; for example, by sending web pages to a web browser on a user’s device in response to requests received from the web browser. Also, a computer can interact with a user by Attorney Docket No. sending text messages or other forms of message to a personal device, e.g., a smartphone that is running a messaging application, and receiving responsive messages from the user in return. Data processing apparatus for implementing machine learning models can also include, for example, special-purpose hardware accelerator units for processing common and compute-intensive parts of machine learning training or production, i.e., inference, workloads. Machine learning models can be implemented and deployed using a machine learning framework, e.g., a TensorFlow framework or a Jax framework. Embodiments of the subject matter described in this specification can be implemented in a computing system that includes a back end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front end component, e.g., a client computer having a graphical user interface, a web browser, or an app through which a user can interact with an implementation of the subject matter described in this specification, or any combination of one or more such back end, middleware, or front end components. The components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (LAN) and a wide area network (WAN), e.g., the Internet. The computing system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. In some embodiments, a server transmits data, e.g., an HTML page, to a user device, e.g., for purposes of displaying data to and receiving user input from a user interacting with the device, which acts as a client. Data generated at the user device, e.g., a result of the user interaction, can be received at the server from the device. While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or on the scope of what may be claimed, but rather as descriptions of features that may be specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments Attorney Docket No. separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially be claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination. Similarly, while operations are depicted in the drawings and recited in the claims in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products. Particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order and still achieve desirable results. As one example, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. What is claimed is:

Claims

Attorney Docket No. CLAIMS 1. A method performed by one or more computers and for training an image encoder neural network having image encoder neural network parameters and configured to process an image to generate an image embedding of the image in an embedding space, the method comprising: maintaining data specifying a respective positional embedding for each patch of a larger image having a second size; obtaining a batch of training examples, each training example comprising a respective training image having a first size that is less than the second size; for each training image: dividing the training image into a plurality of patches; generating a respective patch embedding of each patch; generating a respective positional embedding for each patch, comprising: identifying a crop of the larger image, wherein: the larger image has the second size that is larger than the first size, the crop has the first size, and the crop contains a plurality of patches that each correspond to one of the plurality of patches in the larger image and to one of the plurality of patches of the training image; and assigning, to each patch in the training image, the respective positional embedding for the corresponding patch of the crop from the maintained data; generating a respective combined embedding of each patch by combining the respective patch embedding and the respective positional embedding of the patch; and processing the respective combined embeddings using the image encoder neural network to generate a training embedding of the training image; and training the image encoder neural network using the training embeddings of the training image on a first loss function for the first task. Attorney Docket No. 2. The method of claim 1, wherein: the image encoder neural network is trained jointly with a text encoder neural network having text encoder neural network parameters and configured to process a text segment to generate a text embedding of the text segment in the embedding space, each training example further comprises a respective training text segment, and the method further comprises: for each training text segment: processing the training text segment using the text encoder neural network to generate a training text embedding of the training text segment; and wherein the first loss function is a contrastive loss function that is based on similarities between the training text embeddings and the training image embeddings. 3. The method of claim 2, wherein the contrastive loss function is a focal loss function. 4. The method of any preceding claim, further comprising: after training the image encoder neural network on the first loss function for the first task: training a task neural network that comprises the image encoder neural network to perform a computer vision task. 5. The method of claim 4, wherein the computer vision task requires operating on images having the second size. 6. The method of claim 4 or claim 5, wherein the computer vision task is object detection. 7. The method of claim 6, wherein the computer vision task is open vocabulary object detection. 8. The method of any preceding claim, wherein identifying a crop of the larger image comprises: generating an initial crop of the larger image; and re-sizing the initial crop to have the first size. Attorney Docket No. 9. The method of claim 8, wherein generating the initial crop comprises: randomly cropping the larger image. 10. The method of claim 8 or claim 9, wherein generating an initial crop of the larger image comprises: selecting a scale ratio for the initial crop; selecting an aspect ratio for the initial crop; and randomly cropping a region from the larger image that has the selected scale ratio and he selected aspect ratio. 11. The method of any preceding claim, wherein training the image encoder neural network using the training embeddings of the training image on a first loss function for the first task comprises: updating the respective positional embeddings for the patches of the larger image having the second size using gradients of the first loss function. 12. The method of any preceding claim, wherein generating a respective patch embedding of each patch comprises: processing the intensity values of the pixels in the patch using a patch embedding subnetwork to generate the respective patch embedding. 13. The method of claim 12, wherein training the image encoder neural network using the training embeddings of the training image on a first loss function for the first task comprises: training the patch embedding subnetwork using gradients of the first loss function. 14. The method of any preceding claim, wherein maintaining a respective positional embedding for each patch of the larger image comprises: maintaining a respective positional embedding for each patch of a region having the first size; and generating the respective positional embeddings for each of the patches of the larger image having the second size by interpolating the respective positional embeddings for the patches of the region having the first size to have the second size. Attorney Docket No. 15. The method of claim 14, when dependent on claim 11, wherein updating the respective positional embeddings for the patches of the larger image having the second size using gradients of the first loss function comprises: updating the respective positional embeddings for the patches of the region having the first size using gradients of the first loss function; and updating the respective positional embeddings for each of the patches of the larger image having the second size by interpolating the respective updated positional embeddings for the patches of the region having the first size to have the second size. 16. The method of any preceding claim, wherein generating a respective combined embedding of each patch by combining the respective patch embedding and the respective positional embedding of the patch comprises: summing the respective patch embedding and the respective positional embedding of the patch. 17. The method of any preceding clam, wherein the image encoder neural network is a vision Transformer neural network. 18. A system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one more computers to perform the operations of the respective method of any one of claims 1-17. 19. One or more computer storage media storing instructions that when executed by one or more computers cause the one more computers to perform the operations of the respective method of any one of claims 1-17.