Encoding: Base64 vs. ASCII or UTF-8
Text Encoding (ASCII and UTF-8)
Text encoding schemes like ASCII and UTF-8 define how characters (text) are represented as binary data (bits):
- ASCII: Uses 7 bits per character, extended to 8 bits (1 byte) for practical purposes.
- Example: ‘A’ -> 01000001 (65 in decimal)
- UTF-8: Uses 1 to 4 bytes per character to represent all Unicode characters.
- Example: ‘A’ -> 01000001 (1 byte), ‘€’ -> 11100010 10000010 10101100 (3 bytes)
Examples:
= "Hello, world!"
text with open('hello.txt', 'w', encoding='utf-8') as f:
f.write(text)
with open('hello.txt', 'r', encoding='utf-8') as f:
= f.read()
content print(content) # Output: Hello, world!
When you transmit text over the internet (e.g., via HTTP), the text is sent as binary data. The text is encoded in a specific encoding (like UTF-8) before transmission. The recipient decodes the binary data back into text using the same encoding.
What is Base64 Encoding?
- Base64: A method of encoding binary data into a string of 64 printable ASCII characters. It is used to encode binary data (e.g., images, files) into text that can be safely transmitted over text-based protocols such as HTTP, SMTP, etc.
- Character Set: The 64 characters used in Base64 are:
- A-Z (uppercase letters)
- a-z (lowercase letters)
- 0-9 (digits)
- and / (two additional symbols)
Purpose of Base64: Base64 encoding converts binary data into a string of ASCII characters. This is useful for embedding binary data in text-based formats like JSON, XML, or HTTP requests and responses.
Base64 encoding itself does not inherently use ASCII or UTF-8; instead, it produces a string of characters that fall within the ASCII character set. Let’s break this down:
Base64 and Character Encodings (ASCII/UTF-8)
- ASCII: The output of Base64 encoding is a string that uses only ASCII characters. This means that any Base64-encoded string is also valid ASCII text.
- UTF-8: UTF-8 is a superset of ASCII. Any ASCII string is also a valid UTF-8 string. Therefore, Base64-encoded strings can be safely represented as UTF-8.
How Base64 Works
- Encoding Process:
- Binary data is grouped into 24-bit chunks (3 bytes).
- Each 24-bit chunk is split into four 6-bit groups.
- Each 6-bit group is mapped to a corresponding character in the Base64 alphabet.
- Output:
- The output is a string of ASCII characters that represents the binary data.
Example of Base64 Encoding
- Binary Data: Let’s say we have binary data representing the text “Hello”.
- ‘H’ ->
01001000
- ‘e’ ->
01100101
- ‘l’ ->
01101100
- ‘l’ ->
01101100
- ‘o’ ->
01101111
- ‘H’ ->
- Grouping: Group into 24-bit chunks and then into 6-bit groups.
010010 000110 010101 101100 011011 011011 011111
- Mapping to Base64 Characters:
010010
->S
000110
->G
010101
->V
101100
->s
011011
->b
011011
->b
011111
->v
- Result: The Base64-encoded string is “SGVsbG8=”.
Transmission of Base64
Base64 Encoding: Convert binary data to a Base64 string.
- Example: A binary image file is converted to a Base64 string.
import base64
# Binary data (example: part of a JPEG file)
= b'\xff\xd8\xff\xe0\x00\x10JFIF...'
binary_data
# Encode binary data as Base64
= base64.b64encode(binary_data)
base64_encoded = base64_encoded.decode('ascii')
base64_string print(base64_string) # Output: '/9j/4AAQSkZJRgABAQEASABIAAD/...'
HTTP Transmission: When transmitting over HTTP, the Base64 string is included in the HTTP request or response body.
- Example: JSON payload in an HTTP request
{
"image_data": "/9j/4AAQSkZJRgABAQEASABIAAD/..."
}
Conversion to Binary: Before the data leaves your computer, it is converted to binary form. - Text data (including Base64 strings) is encoded as bytes.
import requests
= {
json_payload "image_data": base64_string
}
= requests.post('http://example.com/upload', json=json_payload)
response print(response.status_code)
Binary Transmission: The network protocol (e.g., HTTP) handles the conversion of text data (Base64 string) to binary data for transmission. This binary data is then sent over the network.
Reception and Decoding - Binary Data Reception: The receiver gets the binary data transmitted over the network. - Text Decoding: The binary data is decoded back to text (the original Base64 string). - Base64 Decoding: The Base64 string is decoded back to the original binary data.
import base64
# Simulate receiving the Base64 string from an HTTP response
= response.json()['image_data']
received_base64_string
# Decode Base64 string back to binary data
= base64.b64decode(received_base64_string) received_binary_data
Practical Example in Python
Encoding Binary Data to Base64
import base64
# Original text
= "Hello"
text
# Convert text to bytes using UTF-8
= text.encode('utf-8')
utf8_bytes
# Encode bytes to Base64
= base64.b64encode(utf8_bytes)
base64_encoded = base64_encoded.decode('ascii') # Base64 string using ASCII characters
base64_string print(base64_string) # Output: SGVsbG8=
Decoding Base64 to Binary Data
# Decode Base64 string to bytes
= base64.b64decode(base64_string)
decoded_bytes
# Convert bytes back to text using UTF-8
= decoded_bytes.decode('utf-8')
decoded_text print(decoded_text) # Output: Hello
Summary
- Base64: Encodes binary data into a string of 64 ASCII characters.
- Character Set: The Base64 alphabet consists of ASCII characters.
- UTF-8 Compatibility: Since ASCII is a subset of UTF-8, Base64-encoded strings are also valid UTF-8 strings.
- Encoding and Decoding: Base64 is used to convert binary data into a text format for safe transmission and can be decoded back to binary data.
In practice, when you Base64 encode data in Python or another language, the resulting string can be safely handled as ASCII or UTF-8 text, ensuring compatibility across various text-based protocols and systems.