Image inputs are metered and charged in tokens. The token cost of a given image can be determined by its size and level of detail.
Each provider calculates the image analysis differently. For example, Open AI processes low-detail images at a cost of 85 tokens each. High-detailed images are first scaled to fit within a 2048 x 2048 square, maintaining their aspect ratio. Then, they are scaled so that the shortest side of the image is 768px long. Finally, Open AI will then count how many 512px squares the image consists of. Each 512px square costs 170 tokens and another 85 tokens are always added to the final total. See
Open AI Vision pricing calculator for more details.
The response tokens for vision models are metered the same as text-based models. The amount of tokens used when analyzing images will vary depending on the model you are using.