struct --- 将字节串解读为打包的二进制数据

源代码: Lib/struct.py


This module converts between Python values and C structs represented as Python bytes objects. Compact format strings describe the intended conversions to/from Python values. The module's functions and objects can be used for two largely distinct applications, data exchange with external sources (files or network connections), or data transfer between the Python application and the C layer.

备注

When no prefix character is given, native mode is the default. It packs or unpacks data based on the platform and compiler on which the Python interpreter was built. The result of packing a given C struct includes pad bytes which maintain proper alignment for the C types involved; similarly, alignment is taken into account when unpacking. In contrast, when communicating data between external sources, the programmer is responsible for defining byte ordering and padding between elements. See 字节顺序,大小和对齐方式 for details.

某些 struct 的函数(以及 Struct 的方法)接受一个 buffer 参数。 这将指向实现了 缓冲协议 并提供只读或是可读写缓冲的对象。 用于此目的的最常见类型为 bytesbytearray,但许多其他可被视为字节数组的类型也实现了缓冲协议,因此它们无需额外从 bytes 对象复制即可被读取或填充。

函数和异常

此模块定义了下列异常和函数:

exception struct.error

会在多种场合下被引发的异常;其参数为一个描述错误信息的字符串。

struct.pack(format, v1, v2, ...)

返回一个 bytes 对象,其中包含根据格式字符串 format 打包的值 v1, v2, ... 参数个数必须与格式字符串所要求的值完全匹配。

struct.pack_into(format, buffer, offset, v1, v2, ...)

根据格式字符串 format 打包 v1, v2, ... 等值并将打包的字节串写入可写缓冲区 bufferoffset 开始的位置。 请注意 offset 是必需的参数。

struct.unpack(format, buffer)

根据格式字符串 format 从缓冲区 buffer 解包(假定是由 pack(format, ...) 打包)。 结果为一个元组,即使其只包含一个条目。 缓冲区的字节大小必须匹配格式所要求的大小,如 calcsize() 所示。

struct.unpack_from(format, /, buffer, offset=0)

buffer 从位置 offset 开始根据格式字符串 format 进行解包。 结果为一个元组,即使其中只包含一个条目。 缓冲区的字节大小从位置 offset 开始必须至少为 calcsize() 显示的格式所要求的大小。

struct.iter_unpack(format, buffer)

Iteratively unpack from the buffer buffer according to the format string format. This function returns an iterator which will read equally sized chunks from the buffer until all its contents have been consumed. The buffer's size in bytes must be a multiple of the size required by the format, as reflected by calcsize().

每次迭代将产生一个如格式字符串所指定的元组。

3.4 新版功能.

struct.calcsize(format)

返回与格式字符串 format 相对应的结构的大小(亦即 pack(format, ...) 所产生的字节串对象的大小)。

格式字符串

Format strings describe the data layout when packing and unpacking data. They are built up from format characters, which specify the type of data being packed/unpacked. In addition, special characters control the byte order, size and alignment. Each format string consists of an optional prefix character which describes the overall properties of the data and one or more format characters which describe the actual data values and padding.

字节顺序,大小和对齐方式

By default, C types are represented in the machine's native format and byte order, and properly aligned by skipping pad bytes if necessary (according to the rules used by the C compiler). This behavior is chosen so that the bytes of a packed struct correspond exactly to the memory layout of the corresponding C struct. Whether to use native byte ordering and padding or standard formats depends on the application.

或者,根据下表,格式字符串的第一个字符可用于指示打包数据的字节顺序,大小和对齐方式:

字符

字节顺序

大小

对齐方式

@

按原字节

按原字节

按原字节

=

按原字节

标准

<

小端

标准

>

大端

标准

!

网络(=大端)

标准

如果第一个字符不是其中之一,则假定为 '@'

Native byte order is big-endian or little-endian, depending on the host system. For example, Intel x86, AMD64 (x86-64), and Apple M1 are little-endian; IBM z and many legacy architectures are big-endian. Use sys.byteorder to check the endianness of your system.

本机大小和对齐方式是使用 C 编译器的 sizeof 表达式来确定的。 这总是会与本机字节顺序相绑定。

标准大小仅取决于格式字符;请参阅 格式字符 部分中的表格。

请注意 '@''=' 之间的区别:两个都使用本机字节顺序,但后者的大小和对齐方式是标准化的。

形式 '!' 代表网络字节顺序总是使用在 IETF RFC 1700 中所定义的大端序。

没有什么方式能指定非本机字节顺序(强制字节对调);请正确选择使用 '<''>'

注释:

  1. 填充只会在连续结构成员之间自动添加。 填充不会添加到已编码结构的开头和末尾。

  2. 当使用非本机大小和对齐方式即 '<', '>', '=', and '!' 时不会添加任何填充。

  3. 要将结构的末尾对齐到符合特定类型的对齐要求,请以该类型代码加重复计数的零作为格式结束。 参见 例子

格式字符

格式字符具有以下含义;C 和 Python 值之间的按其指定类型的转换应当是相当明显的。 ‘标准大小’列是指当使用标准大小时以字节表示的已打包值大小;也就是当格式字符串以 '<', '>', '!''=' 之一开头的情况。 当使用本机大小时,已打包值的大小取决于具体的平台。

格式

C 类型

Python 类型

标准大小

备注

x

填充字节

(7)

c

char

长度为 1 的字节串

1

b

signed char

整数

1

(1), (2)

B

unsigned char

整数

1

(2)

?

_Bool

bool

1

(1)

h

short

整数

2

(2)

H

unsigned short

整数

2

(2)

i

int

整数

4

(2)

I

unsigned int

整数

4

(2)

l

long

整数

4

(2)

L

unsigned long

整数

4

(2)

q

long long

整数

8

(2)

Q

unsigned long long

整数

8

(2)

n

ssize_t

整数

(3)

N

size_t

整数

(3)

e

(6)

float

2

(4)

f

float

float

4

(4)

d

double

float

8

(4)

s

char[]

字节串

(9)

p

char[]

字节串

(8)

P

void*

整数

(5)

在 3.3 版更改: 增加了对 'n''N' 格式的支持

在 3.6 版更改: 添加了对 'e' 格式的支持。

注释:

  1. The '?' conversion code corresponds to the _Bool type defined by C99. If this type is not available, it is simulated using a char. In standard mode, it is always represented by one byte.

  2. When attempting to pack a non-integer using any of the integer conversion codes, if the non-integer has a __index__() method then that method is called to convert the argument to an integer before packing.

    在 3.2 版更改: Added use of the __index__() method for non-integers.

  3. 'n''N' 转换码仅对本机大小可用(选择为默认或使用 '@' 字节顺序字符)。 对于标准大小,你可以使用适合你的应用的任何其他整数格式。

  4. 对于 'f', 'd''e' 转换码,打包表示形式将使用 IEEE 754 binary32, binary64 或 binary16 格式 (分别对应于 'f', 'd''e'),无论平台使用何种浮点格式。

  5. 'P' 格式字符仅对本机字节顺序可用(选择为默认或使用 '@' 字节顺序字符)。 字节顺序字符 '=' 选择使用基于主机系统的小端或大端排序。 struct 模块不会将其解读为本机排序,因此 'P' 格式将不可用。

  6. IEEE 754 binary16 "半精度" 类型是在 IEEE 754 标准 的 2008 修订版中引入的。 它包含一个符号位,5 个指数位和 11 个精度位(明确存储 10 位),可以完全精确地表示大致范围在 6.1e-056.5e+04 之间的数字。 此类型并不被 C 编译器广泛支持:在一台典型的机器上,可以使用 unsigned short 进行存储,但不会被用于数学运算。 请参阅维基百科页面 half-precision floating-point format 了解详情。

  7. When packing, 'x' inserts one NUL byte.

  8. 'p' 格式字符用于编码“Pascal 字符串”,即存储在由计数指定的 固定长度字节 中的可变长度短字符串。 所存储的第一个字节为字符串长度或 255 中的较小值。 之后是字符串对应的字节。 如果传入 pack() 的字符串过长(超过计数值减 1),则只有字符串前 count-1 个字节会被存储。 如果字符串短于 count-1,则会填充空字节以使得恰好使用了 count 个字节。 请注意对于 unpack()'p' 格式字符会消耗 count 个字节,但返回的字符串永远不会包含超过 255 个字节。

  9. For the 's' format character, the count is interpreted as the length of the bytes, not a repeat count like for the other format characters; for example, '10s' means a single 10-byte string mapping to or from a single Python byte string, while '10c' means 10 separate one byte character elements (e.g., cccccccccc) mapping to or from ten different Python byte objects. (See 例子 for a concrete demonstration of the difference.) If a count is not given, it defaults to 1. For packing, the string is truncated or padded with null bytes as appropriate to make it fit. For unpacking, the resulting bytes object always has exactly the specified number of bytes. As a special case, '0s' means a single, empty string (while '0c' means 0 characters).

格式字符之前可以带有整数重复计数。 例如,格式字符串 '4h' 的含义与 'hhhh' 完全相同。

格式之间的空白字符会被忽略;但是计数及其格式字符中不可有空白字符。

当使用某一种整数格式 ('b', 'B', 'h', 'H', 'i', 'I', 'l', 'L', 'q', 'Q') 打包值 x 时,如果 x 在该格式的有效范围之外则将引发 struct.error

在 3.1 版更改: 在之前版本中,某些整数格式包装了超范围的值并会引发 DeprecationWarning 而不是 struct.error

对于 '?' 格式字符,返回值为 TrueFalse。 在打包时将会使用参数对象的逻辑值。 以本机或标准 bool 类型表示的 0 或 1 将被打包,任何非零值在解包时将为 True

例子

备注

Native byte order examples (designated by the '@' format prefix or lack of any prefix character) may not match what the reader's machine produces as that depends on the platform and compiler.

Pack and unpack integers of three different sizes, using big endian ordering:

>>> from struct import *
>>> pack(">bhl", 1, 2, 3)
b'\x01\x00\x02\x00\x00\x00\x03'
>>> unpack('>bhl', b'\x01\x00\x02\x00\x00\x00\x03')
(1, 2, 3)
>>> calcsize('>bhl')
7

Attempt to pack an integer which is too large for the defined field:

>>> pack(">h", 99999)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
struct.error: 'h' format requires -32768 <= number <= 32767

Demonstrate the difference between 's' and 'c' format characters:

>>> pack("@ccc", b'1', b'2', b'3')
b'123'
>>> pack("@3s", b'123')
b'123'

解包的字段可通过将它们赋值给变量或将结果包装为一个具名元组来命名:

>>> record = b'raymond   \x32\x12\x08\x01\x08'
>>> name, serialnum, school, gradelevel = unpack('<10sHHb', record)

>>> from collections import namedtuple
>>> Student = namedtuple('Student', 'name serialnum school gradelevel')
>>> Student._make(unpack('<10sHHb', record))
Student(name=b'raymond   ', serialnum=4658, school=264, gradelevel=8)

The ordering of format characters may have an impact on size in native mode since padding is implicit. In standard mode, the user is responsible for inserting any desired padding. Note in the first pack call below that three NUL bytes were added after the packed '#' to align the following integer on a four-byte boundary. In this example, the output was produced on a little endian machine:

>>> pack('@ci', b'#', 0x12131415)
b'#\x00\x00\x00\x15\x14\x13\x12'
>>> pack('@ic', 0x12131415, b'#')
b'\x15\x14\x13\x12#'
>>> calcsize('@ci')
8
>>> calcsize('@ic')
5

The following format 'llh0l' results in two pad bytes being added at the end, assuming the platform's longs are aligned on 4-byte boundaries:

>>> pack('@llh0l', 1, 2, 3)
b'\x00\x00\x00\x01\x00\x00\x00\x02\x00\x03\x00\x00'

参见

模块 array

被打包为二进制存储的同质数据。

Module json

JSON encoder and decoder.

Module pickle

Python object serialization.

Applications

Two main applications for the struct module exist, data interchange between Python and C code within an application or another application compiled using the same compiler (native formats), and data interchange between applications using agreed upon data layout (standard formats). Generally speaking, the format strings constructed for these two domains are distinct.

Native Formats

When constructing format strings which mimic native layouts, the compiler and machine architecture determine byte ordering and padding. In such cases, the @ format character should be used to specify native byte ordering and data sizes. Internal pad bytes are normally inserted automatically. It is possible that a zero-repeat format code will be needed at the end of a format string to round up to the correct byte boundary for proper alignment of consective chunks of data.

Consider these two simple examples (on a 64-bit, little-endian machine):

>>> calcsize('@lhl')
24
>>> calcsize('@llh')
18

Data is not padded to an 8-byte boundary at the end of the second format string without the use of extra padding. A zero-repeat format code solves that problem:

>>> calcsize('@llh0l')
24

The 'x' format code can be used to specify the repeat, but for native formats it is better to use a zero-repeat format like '0l'.

By default, native byte ordering and alignment is used, but it is better to be explicit and use the '@' prefix character.

Standard Formats

When exchanging data beyond your process such as networking or storage, be precise. Specify the exact byte order, size, and alignment. Do not assume they match the native order of a particular machine. For example, network byte order is big-endian, while many popular CPUs are little-endian. By defining this explicitly, the user need not care about the specifics of the platform their code is running on. The first character should typically be < or > (or !). Padding is the responsibility of the programmer. The zero-repeat format character won't work. Instead, the user must explicitly add 'x' pad bytes where needed. Revisiting the examples from the previous section, we have:

>>> calcsize('<qh6xq')
24
>>> pack('<qh6xq', 1, 2, 3) == pack('@lhl', 1, 2, 3)
True
>>> calcsize('@llh')
18
>>> pack('@llh', 1, 2, 3) == pack('<qqh', 1, 2, 3)
True
>>> calcsize('<qqh6x')
24
>>> calcsize('@llh0l')
24
>>> pack('@llh0l', 1, 2, 3) == pack('<qqh6x', 1, 2, 3)
True

The above results (executed on a 64-bit machine) aren't guaranteed to match when executed on different machines. For example, the examples below were executed on a 32-bit machine:

>>> calcsize('<qqh6x')
24
>>> calcsize('@llh0l')
12
>>> pack('@llh0l', 1, 2, 3) == pack('<qqh6x', 1, 2, 3)
False

struct 模块还定义了以下类型:

class struct.Struct(format)

Return a new Struct object which writes and reads binary data according to the format string format. Creating a Struct object once and calling its methods is more efficient than calling module-level functions with the same format since the format string is only compiled once.

备注

传递给 Struct 和模块层级函数的已编译版最新格式字符串会被缓存,因此只使用少量格式字符串的程序无需担心重用单独的 Struct 实例。

已编译的 Struct 对象支持以下方法和属性:

pack(v1, v2, ...)

等价于 pack() 函数,使用了已编译的格式。 (len(result) 将等于 size。)

pack_into(buffer, offset, v1, v2, ...)

等价于 pack_into() 函数,使用了已编译的格式。

unpack(buffer)

等价于 unpack() 函数,使用了已编译的格式。 缓冲区的字节大小必须等于 size

unpack_from(buffer, offset=0)

等价于 unpack_from() 函数,使用了已编译的格式。 缓冲区的字节大小从位置 offset 开始必须至少为 size

iter_unpack(buffer)

等价于 iter_unpack() 函数,使用了已编译的格式。 缓冲区的大小必须为 size 的整数倍。

3.4 新版功能.

format

用于构造此 Struct 对象的格式字符串。

在 3.7 版更改: 格式字符串类型现在是 str 而不再是 bytes

size

计算出对应于 format 的结构大小(亦即 pack() 方法所产生的字节串对象的大小)。