Blob and Row Overflow Storage Internals – Row-Overflow Data

2012-07-31 - General, SQL Server Internals

Introduction

SQL Server has two mechanisms to store large amounts of data in a single row. One is Large Object (LOB) data storage. The other one is Row-Overflow data storage. In this series of articles I am going to take a deeper look into these two mechanisms.

What is a Page

SQL Server organizes everything in pages. Data on disk is split into pages, data is loaded into memory one page at a time and even data in memory is always accessed on a page level.

There are several different types of pages. Data pages, index pages, IAM pages that map which page belongs to which index, pages that track which pages in the database are in use and which ones are free and LOB and Row-Overflow pages.

Each page is 8192 bytes in size. The first 96 bytes contain the page header. The page header contains information about the page itself like the type of information stored in it. It contains links to the object and the allocation unit the page belongs to. If the page is part of a sorted index it also contains a link to its predecessor and successor pages.

At the end of data pages you will find a row offset array that contains a 2 byte pointer for each row in the page. This row offset array establishes the index order within the page; the data itself might not be stored in order within the page. The remaining 8192 – 96 – 2*(number of rows) bytes can be used by actual row data.

If you fill up the space with many small rows, you can actually completely fill it. However the maximum length of a single row is not 8094 as above formula suggests but instead only 8060. The remaining 34 bytes are reserved for additional row (meta-)data like a hidden uniqueifier column for CLUSTERED INDEXES that were not explicitly declared UNIQUE, or row version information used by snapshot isolation.

Each row contains at least seven bytes of metadata information. Based on the number of columns and their data types there could be potentially a lot more metadata in a row. The first seven metadata bytes are at the beginning of the row, the remaining ones are interspersed throughout. That leaves at most 8053 bytes for actual data.

All fixed length columns like INT or CHAR(n) must fit into that size. Up to SQL Server 2000 all other columns also had to fit in there too. That requirement got removed in later versions by introducing the Row-Overflow Storage mechanism.

Row-Overflow Data

Starting with SQL Server 2005, tables can be created that allow for more than 8053 bytes to be stored within a row. For example it is now possible to create a table with several VARCHAR(8000) columns.

Even before SQL Server 2005 large amounts of data could be stored in the now deprecated IMAGE and TEXT data types. SQL Server 2005 introduced the [N]VARCHAR(MAX) and VARBINARY(MAX) data types to replace TEXT and IMAGE. While the storage mechanism behind these data types is very similar to the Row-Overflow storage mechanisms, this article is going to concentrate on Row-Overflow data only.

Row-Overflow Allocation Units

The header and the data within a single row have to fit within a single "data" page. To allow data to spill past the boundaries of a single page, a new page type was introduced: "Row-Overflow Data" pages.

All data and index pages that belong to a single partition of a single index or heap are grouped together into a single unit called "allocation unit". Each index or heap partition has at least one allocation unit, the data page allocation unit. Up to SQL Server 2012 one is also the maximum number of data page allocation units per partition.

Row-Overflow pages are not stored inside that same allocation unit. Instead they are grouped inside a separate allocation unit, the Row-Overflow data allocation unit. Again, SQL Server 2012 and earlier allow at max one Row-Overflow allocation unit per index or heap partition

As soon as a table is created, an empty allocation unit for data pages is created. If that table has the potential to require Row-Overflow data storage, a Row-Overflow data allocation unit is created as well. If the table is partitioned, the same is done for each declared partition of the table, so a table with four partitions and the potential for Row-Overflow data will have 8 allocation units created and assigned to it right when the table itself is created.

To confirm the above (ignoring partitions for now) you can use this script to create two tables, one with a single VARCHAR(8000) column and one with two:

[sql] CREATE TABLE dbo.Standard(
Id BIGINT CONSTRAINT [dbo.Standard(Id):PK] PRIMARY KEY CLUSTERED,
VC1 VARCHAR(8000)
);

CREATE TABLE dbo.Overflow(
Id BIGINT CONSTRAINT [dbo.Overflow(Id):PK] PRIMARY KEY CLUSTERED,
VC1 VARCHAR(8000),
VC2 VARCHAR(8000)
);
[/sql]

After those tables are created run the following query:

[sql] SELECT OBJECT_SCHEMA_NAME(p.object_id) table_schema,
OBJECT_NAME(p.object_id) table_name,
p.index_id,
p.partition_number,
au.allocation_unit_id,
au.type_desc,
au.total_pages,
au.used_pages,
au.data_pages
FROM sys.system_internals_allocation_units au
JOIN sys.partitions p
ON au.container_id = p.partition_id
ORDER BY table_schema,
table_name,
p.index_id,
p.partition_number,
au.type;
[/sql]

For the two tables it will show a total of three allocation units, two of type IN_ROW_DATA and one of type ROW_OVERFLOW_DATA. If you did not insert any data, all three will show 0 in the total_pages column.

Too Much Data

Now let us insert two rows into the dbo.Overflow table:

[sql] INSERT INTO dbo.Overflow(ID,VC1,VC2)
VALUES
(1,REPLICATE('A',4000),REPLICATE('B',4000)),
(2,REPLICATE('A',4000),REPLICATE('B',8000));
[/sql]

The first row has 8008 bytes of data plus an additional 13 bytes of metadata totaling 8021 bytes. That is less than the 8060 bytes maximum, so it will fit into a page. To confirm that it actually gets stored in a single page you can run the following scripts:

[sql] DBCC IND(0,'dbo.Overflow',-1)WITH NO_INFOMSGS;
[/sql]

This script is using the not officially supported DBCC IND command. DBCC IND takes three parameters. The first is the database name or database id. If a 0 is passed in here, the current database is used. The second parameter is the table name. The last parameter is the index id. It is also possible to pass in a -1 here to have DBCC IND return information about all indexes of that table.

The above script will return a result set like this one:

PageFID PagePID IAMFID IAMPID ObjectID IndexID PartitionNumber PartitionID iam_chain_type PageType IndexLevel
1 6315 NULL NULL 677577452 1 1 72057594040025088 In-row data 10 NULL
1 6314 1 6315 677577452 1 1 72057594040025088 In-row data 1 0
1 4795 1 6315 677577452 1 1 72057594040025088 In-row data 2 1
1 4796 1 6315 677577452 1 1 72057594040025088 In-row data 1 0
1 6317 NULL NULL 677577452 1 1 72057594040025088 Row-overflow data 10 NULL
1 6316 1 6317 677577452 1 1 72057594040025088 Row-overflow data 3 0

This result set contains one row per page that belongs to the dbo.Overflow table. You can see several different PageType values. For this article we are going to look only at the PageType = 1 pages of which there are two and at the single PageType = 3 page.

Pages with PageType = 1 store the actual row data of the table, pages with PageType = 3 are used to store the Row-Overflow data. The first two columns in above result set give the page address consisting of the file number and the page number within that file. In above example output the first data page has the address 1:6314, the second data page's is 1:4796 and the Row-Overflow page can be found at 1:6316. Those numbers will be different for you, and you will need to adjust the following examples if you want to try this yourself.

To actually look at the data in the pages, we need to use another not officially supported command:

[sql] DBCC TRACEON(3604);
DBCC PAGE(0,1,6314,3)WITH NO_INFOMSGS;
[/sql]

DBCC PAGE takes four parameters. The first one is the database name or database id. Similar to DBCC IND, a "0" means current database. The second and third parameters are the page address. The last one selects the amount of detail returned; we will pick 3 for it. Turning trace flag 3604 on is required to see any output from DBCC PAGE.

The output takes a little getting used to. At the top all the page header information is detailed. The information we are looking for is right beneath the header and will look similar to this (shortened for readability):

[plain] Slot 0 Offset 0x60 Length 8021

Record Type = PRIMARY_RECORD Record Attributes = NULL_BITMAP VARIABLE_COLUMNS
Record Size = 8021
Memory Dump @0x000000001384A060

0000000000000000: 30000c00 01000000 00000000 03000002 †0...............
0000000000000010: 00b50f55 1f414141 41414141 41414141 †.µ.U.AAAAAAAAAAA
<…>
0000000000001F40: 42424242 42424242 42424242 42424242 †BBBBBBBBBBBBBBBB
0000000000001F50: 42424242 42††††††††††††††††††††††††††BBBBB
Slot 0 Column 1 Offset 0x4 Length 8 Length (physical) 8

Id = 1

Slot 0 Column 2 Offset 0x15 Length 4000 Length (physical) 4000

VC1 = AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
<…>
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

Slot 0 Column 3 Offset 0xfb5 Length 4000 Length (physical) 4000

VC2 = BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
<…>
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

Slot 0 Offset 0x0 Length 0 Length (physical) 0

KeyHashValue = (1b7fe5b8af93)
[/plain]

It starts out with telling us that the row in Slot 0 starts at offset 0x60 and has a length of 8021 bytes. That matches what we expected for the first row in the table. After the complete row memory dump, all column values are also listed.

As expected, the entire first row is stored inside a single data page. To look at the second row we need to move on to the second page, as the first page is now full. Running DBCC PAGE for the second data page (1:4796) will return output similar to this:

[plain] Slot 0 Offset 0x60 Length 4045

Record Type = PRIMARY_RECORD Record Attributes = NULL_BITMAP VARIABLE_COLUMNS
Record Size = 4045
Memory Dump @0x000000000D72A060

0000000000000000: 30000c00 02000000 00000000 03000002 †0...............
0000000000000010: 00b50fcd 8f414141 41414141 41414141 †.µ.Í.AAAAAAAAAAA
<..>
0000000000000FB0: 41414141 41020000 00010000 00402700 †AAAAA........@'.
0000000000000FC0: 00401f00 00ac1800 00010000 00††††††††.@...¬.......

Slot 0 Column 1 Offset 0x4 Length 8 Length (physical) 8

Id = 2

Slot 0 Column 2 Offset 0x15 Length 4000 Length (physical) 4000

VC1 = AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
<..>
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

VC2 = [BLOB Inline Root] Slot 0 Column 3 Offset 0xfb5 Length 24 Length (physical) 24

Level = 0 Unused = 0 UpdateSeq = 1
TimeStamp = 658505728
Link 0

Size = 8000 RowId = (1:6316:0)

Slot 0 Offset 0x0 Length 0 Length (physical) 0

KeyHashValue = (f03d7d8b0dcc)
[/plain]

This time the row length is shown as 4045. The value for column VC2 was replaced with a "BLOB Inline Root". That is a structure that points to a Row-Overflow page. The pointer is 24 bytes long. Those 24 bytes contain the actual page address (1:6316) as well as the length of the record stored there. There is also some additional metadata stored in this entry.

Because each BLOB root is 24 bytes in size we can at max fit a little more than 300 of them in one row. Keep that in mind when adding many large variable length columns.

Row-Overflow Records

If you looked closely at the above output, the page address is not (1:6316) but instead (1:6316:0). The additional 0 is the slot number. Like a data page, a Row-Overflow page can have more than one record. Those records can be from the same or from different rows in the underlying table. However, because the page belongs to the allocation unit that is tied to a specific index partition, you will never find records from different partitions, indexes or even tables in the same page.

To see an example for this type of sharing let us run the following script that creates a table with several VARCHAR(1000) columns and inserts 2 rows, each with all columns filled:

[sql] CREATE TABLE dbo.ManyCols(
Id BIGINT CONSTRAINT [dbo.ManyCols(Id):PK] PRIMARY KEY CLUSTERED,
VC1 VARCHAR(1000),
VC2 VARCHAR(1000),
VC3 VARCHAR(1000),
VC4 VARCHAR(1000),
VC5 VARCHAR(1000),
VC6 VARCHAR(1000),
VC7 VARCHAR(1000),
VC8 VARCHAR(1000),
VC9 VARCHAR(1000),
VCA VARCHAR(1000),
VCB VARCHAR(1000),
);

INSERT INTO dbo.ManyCols(id,VC1,VC2,VC3,VC4,VC5,VC6,VC7,VC8,VC9,VCA,VCB)
VALUES
(
1,
REPLICATE('[VC1 Id=1]',100),
REPLICATE('[VC2 Id=1]',100),
REPLICATE('[VC3 Id=1]',100),
REPLICATE('[VC4 Id=1]',100),
REPLICATE('[VC5 Id=1]',100),
REPLICATE('[VC6 Id=1]',100),
REPLICATE('[VC7 Id=1]',100),
REPLICATE('[VC8 Id=1]',100),
REPLICATE('[VC9 Id=1]',100),
REPLICATE('[VCA Id=1]',100),
REPLICATE('[VCB Id=1]',100)
),
(
2,
REPLICATE('[VC1 Id=2]',100),
REPLICATE('[VC2 Id=2]',100),
REPLICATE('[VC3 Id=2]',100),
REPLICATE('[VC4 Id=2]',100),
REPLICATE('[VC5 Id=2]',100),
REPLICATE('[VC6 Id=2]',100),
REPLICATE('[VC7 Id=2]',100),
REPLICATE('[VC8 Id=2]',100),
REPLICATE('[VC9 Id=2]',100),
REPLICATE('[VCA Id=2]',100),
REPLICATE('[VCB Id=2]',100)
);
[/sql]

This table has again two data pages, 1:4800 and 1:4809 on my machine. The Row-Overflow pointers for row 1 look like this:

[plain] Slot 0 Column 1 Offset 0x4 Length 8 Length (physical) 8

Id = 1

VC1 = [BLOB Inline Root] Slot 0 Column 2 Offset 0x28 Length 24 Length (physical) 24

Level = 0 Unused = 0 UpdateSeq = 1
TimeStamp = 1698955264
Link 0

Size = 1000 RowId = (1:4798:0)

VC2 = [BLOB Inline Root] Slot 0 Column 3 Offset 0x40 Length 24 Length (physical) 24

Level = 0 Unused = 0 UpdateSeq = 1
TimeStamp = 698089472
Link 0

Size = 1000 RowId = (1:4798:1)

VC3 = [BLOB Inline Root] Slot 0 Column 4 Offset 0x58 Length 24 Length (physical) 24

Level = 0 Unused = 0 UpdateSeq = 1
TimeStamp = 854392832
Link 0

Size = 1000 RowId = (1:4798:2)

VC4 = [BLOB Inline Root] Slot 0 Column 5 Offset 0x70 Length 24 Length (physical) 24

Level = 0 Unused = 0 UpdateSeq = 1
TimeStamp = 1247477760
Link 0

Size = 1000 RowId = (1:4798:3)

Slot 0 Column 6 Offset 0x88 Length 1000 Length (physical) 1000
[/plain]

Row 2 contains these pointers:

[plain] Slot 0 Column 1 Offset 0x4 Length 8 Length (physical) 8

Id = 2

VC1 = [BLOB Inline Root] Slot 0 Column 2 Offset 0x28 Length 24 Length (physical) 24

Level = 0 Unused = 0 UpdateSeq = 1
TimeStamp = 111345664
Link 0

Size = 1000 RowId = (1:4798:4)

VC2 = [BLOB Inline Root] Slot 0 Column 3 Offset 0x40 Length 24 Length (physical) 24

Level = 0 Unused = 0 UpdateSeq = 1
TimeStamp = 872677376
Link 0

Size = 1000 RowId = (1:4798:5)

VC3 = [BLOB Inline Root] Slot 0 Column 4 Offset 0x58 Length 24 Length (physical) 24

Level = 0 Unused = 0 UpdateSeq = 1
TimeStamp = 123863040
Link 0

Size = 1000 RowId = (1:4798:6)

VC4 = [BLOB Inline Root] Slot 0 Column 5 Offset 0x70 Length 24 Length (physical) 24

Level = 0 Unused = 0 UpdateSeq = 1
TimeStamp = 876019712
Link 0

Size = 1000 RowId = (1:4802:0)

Slot 0 Column 6 Offset 0x88 Length 1000 Length (physical) 1000
[/plain]

As you can see, for each row the first four columns were moved off the data page. All four records for row one got stored in page (1:4798) while row two fit only three more into the same page. The fourth one got moved to a new Row-Overflow page (1:4802).

If you execute DBCC PAGE on the first Row-Overflow page (1:4798), you will get output that looks like this (again shortened for readability):

[plain] Blob row at: Page (1:4798) Slot 0 Length: 1014 Type: 3 (DATA)
Blob Id:1698955264

000000000D72A06E: 3143565b 3d644920 565b5d31 49203143 [VC1 Id=1][VC1 I
<...>
000000000D72A44E: 49203143 5d313d64 C1 Id=1]

Blob row at: Page (1:4798) Slot 1 Length: 1014 Type: 3 (DATA)
Blob Id:698089472

000000000D72A472: 3243565b 3d644920 565b5d31 49203243 [VC2 Id=1][VC2 I
<...>
000000000D72A852: 49203243 5d313d64 C2 Id=1]

Blob row at: Page (1:4798) Slot 2 Length: 1014 Type: 3 (DATA)
Blob Id:854392832

000000000D72A876: 3343565b 3d644920 565b5d31 49203343 [VC3 Id=1][VC3 I
<...>
000000000D72AC56: 49203343 5d313d64 C3 Id=1]

Blob row at: Page (1:4798) Slot 3 Length: 1014 Type: 3 (DATA)
Blob Id:1247477760

000000000D72AC7A: 3443565b 3d644920 565b5d31 49203443 [VC4 Id=1][VC4 I
<...>
000000000D72B05A: 49203443 5d313d64 C4 Id=1]

Blob row at: Page (1:4798) Slot 4 Length: 1014 Type: 3 (DATA)
Blob Id:111345664

000000000D72B07E: 3143565b 3d644920 565b5d32 49203143 [VC1 Id=2][VC1 I
<...>
000000000D72B45E: 49203143 5d323d64 C1 Id=2]

Blob row at: Page (1:4798) Slot 5 Length: 1014 Type: 3 (DATA)
Blob Id:872677376

000000000D72B482: 3243565b 3d644920 565b5d32 49203243 [VC2 Id=2][VC2 I
<...>
000000000D72B862: 49203243 5d323d64 C2 Id=2]

Blob row at: Page (1:4798) Slot 6 Length: 1014 Type: 3 (DATA)
Blob Id:123863040

000000000D72B886: 3343565b 3d644920 565b5d32 49203343 [VC3 Id=2][VC3 I
<...>
000000000D72BC66: 49203343 5d323d64 C3 Id=2] [/plain]

You can use the actual data values in those columns to match them to the row and column in the underlying table.

Each record has a length specified as 1014. As the data itself is only 1000 bytes, this tells us that there is 14 bytes of row metadata. That metadata contains the length of the record as well as the "Blob Id". That blob id matches for each record the value that was shown as "Timestamp" on the Row-Overflow pointer in the data page.

You might also notice that the difference in bytes between the start of two consecutive records in the output above is equal to 0x404 or 1028. That means that there are another 14 bytes following each record. I do not know what they are used for. Leave a comment below if you have any insight into their purpose.

At the end of each Row-Overflow page you will find a slot array, similar to the ones used in data pages. It points to the beginning of each record in the table. To make it visible you need to look at the output of DBCC PAGE with the last parameter set to two. That will display the entire page as a hex dump without adding any interpretation to it.

Conclusion

The Row-Overflow storage mechanism introduced with SQL Server 2005 allows more flexibility when designing your table layout as variable length columns do not have to fit into a single page anymore. Fixed length columns still need to completely reside within the actual data page however.

Also be aware that, while the underlying mechanics are hidden from the user, overuse can have a significant impact on performance, as reading a single row might now require several pages to be accessed. For this reason SQL Server will always prefer to store the entire row in one page if possible, even if that means that more pages in total will be required.

Categories: General, SQL Server Internals

6 Responses to Blob and Row Overflow Storage Internals – Row-Overflow Data

Leave a Reply