Skip to content

Commit e69bfe9

Browse files
tglsfdcxiong-gang
authored andcommitted
Fix multiple problems in WAL replay.
Most of the replay functions for WAL record types that modify more than one page failed to ensure that those pages were locked correctly to ensure that concurrent queries could not see inconsistent page states. This is a hangover from coding decisions made long before Hot Standby was added, when it was hardly necessary to acquire buffer locks during WAL replay at all, let alone hold them for carefully-chosen periods. The key problem was that RestoreBkpBlocks was written to hold lock on each page restored from a full-page image for only as long as it took to update that page. This was guaranteed to break any WAL replay function in which there was any update-ordering constraint between pages, because even if the nominal order of the pages is the right one, any mixture of full-page and non-full-page updates in the same record would result in out-of-order updates. Moreover, it wouldn't work for situations where there's a requirement to maintain lock on one page while updating another. Failure to honor an update ordering constraint in this way is thought to be the cause of bug #7648 from Daniel Farina: what seems to have happened there is that a btree page being split was rewritten from a full-page image before the new right sibling page was written, and because lock on the original page was not maintained it was possible for hot standby queries to try to traverse the page's right-link to the not-yet-existing sibling page. To fix, get rid of RestoreBkpBlocks as such, and instead create a new function RestoreBackupBlock that restores just one full-page image at a time. This function can be invoked by WAL replay functions at the points where they would otherwise perform non-full-page updates; in this way, the physical order of page updates remains the same no matter which pages are replaced by full-page images. We can then further adjust the logic in individual replay functions if it is necessary to hold buffer locks for overlapping periods. A side benefit is that we can simplify the handling of concurrency conflict resolution by moving that code into the record-type-specfic functions; there's no more need to contort the code layout to keep conflict resolution in front of the RestoreBkpBlocks call. In connection with that, standardize on zero-based numbering rather than one-based numbering for referencing the full-page images. In HEAD, I removed the macros XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4. They are still there in the header files in previous branches, but are no longer used by the code. In addition, fix some other bugs identified in the course of making these changes: spgRedoAddNode could fail to update the parent downlink at all, if the parent tuple is in the same page as either the old or new split tuple and we're not doing a full-page image: it would get fooled by the LSN having been advanced already. This would result in permanent index corruption, not just transient failure of concurrent queries. Also, ginHeapTupleFastInsert's "merge lists" case failed to mark the old tail page as a candidate for a full-page image; in the worst case this could result in torn-page corruption. heap_xlog_freeze() was inconsistent about using a cleanup lock or plain exclusive lock: it did the former in the normal path but the latter for a full-page image. A plain exclusive lock seems sufficient, so change to that. Also, remove gistRedoPageDeleteRecord(), which has been dead code since VACUUM FULL was rewritten. Back-patch to 9.0, where hot standby was introduced. Note however that 9.0 had a significantly different WAL-logging scheme for GIST index updates, and it doesn't appear possible to make that scheme safe for concurrent hot standby queries, because it can leave inconsistent states in the index even between WAL records. Given the lack of complaints from the field, we won't work too hard on fixing that branch.
1 parent 3e0c55d commit e69bfe9

File tree

9 files changed

+734
-427
lines changed

9 files changed

+734
-427
lines changed

src/backend/access/gin/ginfast.c

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -289,7 +289,7 @@ ginHeapTupleFastInsert(GinState *ginstate, GinTupleCollector *collector)
289289
if (metadata->head == InvalidBlockNumber)
290290
{
291291
/*
292-
* Main list is empty, so just copy sublist into main list
292+
* Main list is empty, so just insert sublist as main list
293293
*/
294294
START_CRIT_SECTION();
295295

@@ -312,6 +312,14 @@ ginHeapTupleFastInsert(GinState *ginstate, GinTupleCollector *collector)
312312
LockBuffer(buffer, GIN_EXCLUSIVE);
313313
page = BufferGetPage(buffer);
314314

315+
rdata[0].next = rdata + 1;
316+
317+
rdata[1].buffer = buffer;
318+
rdata[1].buffer_std = true;
319+
rdata[1].data = NULL;
320+
rdata[1].len = 0;
321+
rdata[1].next = NULL;
322+
315323
Assert(GinPageGetOpaque(page)->rightlink == InvalidBlockNumber);
316324

317325
START_CRIT_SECTION();

src/backend/access/gin/ginxlog.c

Lines changed: 81 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,9 @@ ginRedoCreateIndex(XLogRecPtr lsn, XLogRecord *record)
7979
MetaBuffer;
8080
Page page;
8181

82+
/* Backup blocks are not used in create_index records */
83+
Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
84+
8285
MetaBuffer = XLogReadBuffer(*node, GIN_METAPAGE_BLKNO, true);
8386
Assert(BufferIsValid(MetaBuffer));
8487
page = (Page) BufferGetPage(MetaBuffer);
@@ -109,6 +112,9 @@ ginRedoCreatePTree(XLogRecPtr lsn, XLogRecord *record)
109112
Buffer buffer;
110113
Page page;
111114

115+
/* Backup blocks are not used in create_ptree records */
116+
Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
117+
112118
buffer = XLogReadBuffer(data->node, data->blkno, true);
113119
Assert(BufferIsValid(buffer));
114120
page = (Page) BufferGetPage(buffer);
@@ -158,9 +164,12 @@ ginRedoInsert(XLogRecPtr lsn, XLogRecord *record)
158164
}
159165
}
160166

161-
/* nothing else to do if page was backed up */
167+
/* If we have a full-page image, restore it and we're done */
162168
if (IsBkpBlockApplied(record, 0))
169+
{
170+
(void) RestoreBackupBlock(lsn, record, 0, false, false);
163171
return;
172+
}
164173

165174
buffer = XLogReadBuffer(data->node, data->blkno, false);
166175
if (!BufferIsValid(buffer))
@@ -254,6 +263,9 @@ ginRedoSplit(XLogRecPtr lsn, XLogRecord *record)
254263
if (data->isData)
255264
flags |= GIN_DATA;
256265

266+
/* Backup blocks are not used in split records */
267+
Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
268+
257269
lbuffer = XLogReadBuffer(data->node, data->lblkno, true);
258270
Assert(BufferIsValid(lbuffer));
259271
lpage = (Page) BufferGetPage(lbuffer);
@@ -364,9 +376,11 @@ ginRedoVacuumPage(XLogRecPtr lsn, XLogRecord *record)
364376
Buffer buffer;
365377
Page page;
366378

367-
/* nothing to do if page was backed up (and no info to do it with) */
368379
if (IsBkpBlockApplied(record, 0))
380+
{
381+
(void) RestoreBackupBlock(lsn, record, 0, false, false);
369382
return;
383+
}
370384

371385
buffer = XLogReadBuffer(data->node, data->blkno, false);
372386
if (!BufferIsValid(buffer))
@@ -414,60 +428,71 @@ static void
414428
ginRedoDeletePage(XLogRecPtr lsn, XLogRecord *record)
415429
{
416430
ginxlogDeletePage *data = (ginxlogDeletePage *) XLogRecGetData(record);
417-
Buffer buffer;
431+
Buffer dbuffer;
432+
Buffer pbuffer;
433+
Buffer lbuffer;
418434
Page page;
419435

420-
if (!(IsBkpBlockApplied(record, 0)))
436+
if (IsBkpBlockApplied(record, 0))
437+
dbuffer = RestoreBackupBlock(lsn, record, 0, false, true);
438+
else
421439
{
422-
buffer = XLogReadBuffer(data->node, data->blkno, false);
423-
if (BufferIsValid(buffer))
440+
dbuffer = XLogReadBuffer(data->node, data->blkno, false);
441+
if (BufferIsValid(dbuffer))
424442
{
425-
page = BufferGetPage(buffer);
443+
page = BufferGetPage(dbuffer);
426444
if (!XLByteLE(lsn, PageGetLSN(page)))
427445
{
428446
Assert(GinPageIsData(page));
429447
GinPageGetOpaque(page)->flags = GIN_DELETED;
430448
PageSetLSN(page, lsn);
431-
MarkBufferDirty(buffer);
449+
MarkBufferDirty(dbuffer);
432450
}
433-
UnlockReleaseBuffer(buffer);
434451
}
435452
}
436453

437-
if (!(IsBkpBlockApplied(record, 1)))
454+
if (IsBkpBlockApplied(record, 1))
455+
pbuffer = RestoreBackupBlock(lsn, record, 1, false, true);
456+
else
438457
{
439-
buffer = XLogReadBuffer(data->node, data->parentBlkno, false);
440-
if (BufferIsValid(buffer))
458+
pbuffer = XLogReadBuffer(data->node, data->parentBlkno, false);
459+
if (BufferIsValid(pbuffer))
441460
{
442-
page = BufferGetPage(buffer);
461+
page = BufferGetPage(pbuffer);
443462
if (!XLByteLE(lsn, PageGetLSN(page)))
444463
{
445464
Assert(GinPageIsData(page));
446465
Assert(!GinPageIsLeaf(page));
447466
GinPageDeletePostingItem(page, data->parentOffset);
448467
PageSetLSN(page, lsn);
449-
MarkBufferDirty(buffer);
468+
MarkBufferDirty(pbuffer);
450469
}
451-
UnlockReleaseBuffer(buffer);
452470
}
453471
}
454472

455-
if (!(IsBkpBlockApplied(record, 2)) && data->leftBlkno != InvalidBlockNumber)
473+
if (IsBkpBlockApplied(record, 2))
474+
(void) RestoreBackupBlock(lsn, record, 2, false, false);
475+
else if (data->leftBlkno != InvalidBlockNumber)
456476
{
457-
buffer = XLogReadBuffer(data->node, data->leftBlkno, false);
458-
if (BufferIsValid(buffer))
477+
lbuffer = XLogReadBuffer(data->node, data->leftBlkno, false);
478+
if (BufferIsValid(lbuffer))
459479
{
460-
page = BufferGetPage(buffer);
480+
page = BufferGetPage(lbuffer);
461481
if (!XLByteLE(lsn, PageGetLSN(page)))
462482
{
463483
Assert(GinPageIsData(page));
464484
GinPageGetOpaque(page)->rightlink = data->rightLink;
465485
PageSetLSN(page, lsn);
466-
MarkBufferDirty(buffer);
486+
MarkBufferDirty(lbuffer);
467487
}
468-
UnlockReleaseBuffer(buffer);
488+
UnlockReleaseBuffer(lbuffer);
469489
}
470490
}
491+
492+
if (BufferIsValid(pbuffer))
493+
UnlockReleaseBuffer(pbuffer);
494+
if (BufferIsValid(dbuffer))
495+
UnlockReleaseBuffer(dbuffer);
471496
}
472497

473498
static void
@@ -495,7 +520,9 @@ ginRedoUpdateMetapage(XLogRecPtr lsn, XLogRecord *record)
495520
/*
496521
* insert into tail page
497522
*/
498-
if (!(record->xl_info & XLR_BKP_BLOCK_1))
523+
if (record->xl_info & XLR_BKP_BLOCK(0))
524+
(void) RestoreBackupBlock(lsn, record, 0, false, false);
525+
else
499526
{
500527
buffer = XLogReadBuffer(data->node, data->metadata.tail, false);
501528
if (BufferIsValid(buffer))
@@ -542,19 +569,24 @@ ginRedoUpdateMetapage(XLogRecPtr lsn, XLogRecord *record)
542569
/*
543570
* New tail
544571
*/
545-
buffer = XLogReadBuffer(data->node, data->prevTail, false);
546-
if (BufferIsValid(buffer))
572+
if (record->xl_info & XLR_BKP_BLOCK(0))
573+
(void) RestoreBackupBlock(lsn, record, 0, false, false);
574+
else
547575
{
548-
Page page = BufferGetPage(buffer);
549-
550-
if (!XLByteLE(lsn, PageGetLSN(page)))
576+
buffer = XLogReadBuffer(data->node, data->prevTail, false);
577+
if (BufferIsValid(buffer))
551578
{
552-
GinPageGetOpaque(page)->rightlink = data->newRightlink;
579+
Page page = BufferGetPage(buffer);
553580

554-
PageSetLSN(page, lsn);
555-
MarkBufferDirty(buffer);
581+
if (!XLByteLE(lsn, PageGetLSN(page)))
582+
{
583+
GinPageGetOpaque(page)->rightlink = data->newRightlink;
584+
585+
PageSetLSN(page, lsn);
586+
MarkBufferDirty(buffer);
587+
}
588+
UnlockReleaseBuffer(buffer);
556589
}
557-
UnlockReleaseBuffer(buffer);
558590
}
559591
}
560592

@@ -573,8 +605,12 @@ ginRedoInsertListPage(XLogRecPtr lsn, XLogRecord *record)
573605
tupsize;
574606
IndexTuple tuples = (IndexTuple) (XLogRecGetData(record) + sizeof(ginxlogInsertListPage));
575607

576-
if (record->xl_info & XLR_BKP_BLOCK_1)
608+
/* If we have a full-page image, restore it and we're done */
609+
if (record->xl_info & XLR_BKP_BLOCK(0))
610+
{
611+
(void) RestoreBackupBlock(lsn, record, 0, false, false);
577612
return;
613+
}
578614

579615
buffer = XLogReadBuffer(data->node, data->blkno, true);
580616
Assert(BufferIsValid(buffer));
@@ -619,6 +655,9 @@ ginRedoDeleteListPages(XLogRecPtr lsn, XLogRecord *record)
619655
Page metapage;
620656
int i;
621657

658+
/* Backup blocks are not used in delete_listpage records */
659+
Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
660+
622661
metabuffer = XLogReadBuffer(data->node, GIN_METAPAGE_BLKNO, false);
623662
if (!BufferIsValid(metabuffer))
624663
return; /* assume index was deleted, nothing to do */
@@ -631,6 +670,16 @@ ginRedoDeleteListPages(XLogRecPtr lsn, XLogRecord *record)
631670
MarkBufferDirty(metabuffer);
632671
}
633672

673+
/*
674+
* In normal operation, shiftList() takes exclusive lock on all the
675+
* pages-to-be-deleted simultaneously. During replay, however, it should
676+
* be all right to lock them one at a time. This is dependent on the fact
677+
* that we are deleting pages from the head of the list, and that readers
678+
* share-lock the next page before releasing the one they are on. So we
679+
* cannot get past a reader that is on, or due to visit, any page we are
680+
* going to delete. New incoming readers will block behind our metapage
681+
* lock and then see a fully updated page list.
682+
*/
634683
for (i = 0; i < data->ndeleted; i++)
635684
{
636685
Buffer buffer = XLogReadBuffer(data->node, data->toDelete[i], false);
@@ -663,7 +712,6 @@ gin_redo(XLogRecPtr beginLoc, XLogRecPtr lsn, XLogRecord *record)
663712
* implement a similar optimization as we have in b-tree, and remove
664713
* killed tuples outside VACUUM, we'll need to handle that here.
665714
*/
666-
RestoreBkpBlocks(lsn, record, false);
667715

668716
topCtx = MemoryContextSwitchTo(opCtx);
669717
switch (info)

0 commit comments

Comments
 (0)