320 likes | 322 Vues
Learn about NOVA, a high-performance and fault-tolerant file system designed for non-volatile main memories. This system ensures data integrity, provides direct access, and offers fast atomic operations.
E N D
NOVA: A High-Performance,Fault-Tolerant File System for Non-Volatile Main Memories Andiry Xu, Lu Zhang, AmirsamanMemaripour, AkshathaGangadharaiah, Amit Borase, Tamires Brito Da Silva, Andy Rudoff (Intel), Steven Swanson Non-Volatile Systems Laboratory Department of Computer Science & Engineering University of California, San Diego
Non-volatile main memory and DAX • Non-volatile main memory (NVMM) • Byte-addressable, durable memory • Latency and bandwidth close to DRAM • Battery-backed DRAM, PCM, 3D XPoint • DAX: direct access (from user-space) • Load/store access • Bypass page-cache • No kernel code in the way
Expectations on NVMM file systems DAX Fault Tolerance Direct Access POSIX I/O Atomicity Speed
DAX Ext4 XFS BtrFS F2FS ✔ ✔ Fault Tolerance Direct Access ❌ ❌ ❌ POSIX I/O Atomicity Speed
DAX PMFSExt4-DAX XFS-DAX ✔ ✔ 3D XPoint NVDIMM DRAM + Flash NVDIMM ✔ ❌ ❌ Fault Tolerance Direct Access POSIX IO Atomicity Speed
DAX NOVA FAST’16 3D XPoint NVDIMM DRAM + Flash NVDIMM ✔ ✔ ✔ ✔ Fault Tolerance Direct Access ❌ POSIX IO Atomicity Speed
DAX NOVA-Fortis SOSP’2017 3D XPoint NVDIMM DRAM + Flash NVDIMM ✔ ✔ ✔ ✔ ✔ Fault Tolerance Direct Access POSIX IO Atomicity Speed
Outline • NOVA architecture • Atomicity • Fault tolerance • Evaluations • Conclusions
NOVA: Log Structured file system For NVMM • Scalable per-inode log design • Inode keeps head/tail pointers • Log contains operation entries • Data pages stored separately CPU 0 CPU 1 Inode table Inode Inode log Head Tail Data 0 Data 1 Data 2
Atomicity: Single log appending • Log-structuring for atomic single log appending • Write, chmod, msync etc. • Atomically update the 8-byte tail • Lower overhead than journaling and shadow paging append(tail, entry); persist(); *tail = new_offset; persist(); Tail Tail Inode log
Atomicity: Lightweight journaling • Lightweight journaling for update across logs • Create, rename, etc • Only journal inode tails instead of metadata or data Directory log Tail Tail Tail Tail File log Dir tail Journal File tail
Atomicity: Copy-on-write • Copy-on-write for atomic file data writes • Data pages stored separately • Recycle pages when overwritten File log Tail Tail Data 0 Data 1 Data 1 Data 2
Fault-tolerance • Snapshot: Point-in-time backups • Data integrity: Continuously detect and repair errors
Snapshot for file access • Global snapshot ID • Saved to each log entry • Checked before recycling file pages Snapshot ID 0 1 Snapshot 0 0 Page 1 1 Page 1 1 Page 1 Inode log Data Data Data Data File write entry Epoch ID Data in snapshot Current data
Snapshot for DAX: Idea • Set pages read-only, then copy-on-write Applications: no file system intervention File system: DAX-mmap() File data: RO
Snapshot for DAX: Incorrect implementation • Application invariant: if V is True, then D is valid Snapshot values Application values • NOVA • snapshot Application thread D = ?; V = False; D = 5; V = True; snapshot_begin(); set_read_only(page_d); copy_on_write(page_d); set_read_only(page_v); snapshot_end(); ? 5 5 ? ? F F T T T ? page fault
Snapshot for DAX: Correct implementation • Application invariant: if V is True, then D is valid Snapshot values Application values • NOVA • snapshot Application thread D = ?; V = False; D = 5; V = True; snapshot_begin(); set_read_only(page_d); set_read_only(page_v); snapshot_end(); copy_on_write(page_d); copy_on_write(page_v); 5 5 ? ? ? F T F F F ? page fault
Application performance during snapshot • Two kinds of applications • File access • DAX-mmap • Normal execution v.s. taking a snapshot every 10 seconds File access DAX-mmap <1% throughput loss average 3.7% throughput loss
NVMM failure modes • Detectable by hardware • Detectable & correctable • Detectable & uncorrectable • Undetectable by hardware • Undetectable media errors • “Scribbles” due to kernel bugs
NVMM failure modes: Media failures • Detectable by hardware • Detectable & correctable • Detectable & uncorrectable • Undetectable by hardware • Undetectable media errors • “Scribbles” due to kernel bugs Software: NVMM Ctrl.: Consumes good data Read Detects & corrects errors NVMM data: Media error
NVMM failure modes: Media failures • Detectable by hardware • Detectable & correctable • Detectable & uncorrectable • Undetectable by hardware • Undetectable media errors • “Scribbles” due to kernel bugs Software: NVMM Ctrl.: Receives MCE Read Detects uncorrectable errors Raises exception NVMM data: Media error
NVMM failure modes: Media failures • Detectable by hardware • Detectable & correctable • Detectable & uncorrectable • Undetectable by hardware • Undetectable media errors • “Scribbles” due to kernel bugs Software: NVMM Ctrl.: Consumes corrupted data Read Sees no error NVMM data: Media error
NVMM failure modes: Scribbles • Detectable by hardware • Detectable & correctable • Detectable & uncorrectable • Undetectable by hardware • Undetectable media errors • “Scribbles” due to kernel bugs Software: NVMM Ctrl.: Bug code scribbles NVMM Write Updates ECC NVMM data: Scribble error
Capture machine checks • Copy (meta)data from NVMM to DRAM • memcpy with exception handling • Check memcpy return code
NOVA metadata protection • Checksum + replication • Per-inode, per-log-entry checksum • Replicate all metadata structures Head Tail replica inode inode C1 Entn Cn Ent1 … Head’ Head1 Head1 Tail1 Tail’ Tail1 csum’ csum csum Head2 Head2 Tail2 Tail2 Ent1 C1 … Entn Cn Data 1 Data 2 Head Tail csum
NOVA tick-tock metadata updates • Always have an fail-safe version of metadata Old Updating New replicainode Head1 Tail1 csum Head2 Tail2 inode Head1 Tail1 csum Head2 Tail2
NOVA data integrity • Checksum + parity • Sub-page per-strip checksums • One parity strip for each page Data 1 Data 2 1 Page (8 strips) P = S0..7 S0 S1 S2 S3 S4 S5 S6 S7 P Ci = CRC32(Si) Replicated Ci
Conclusions • NOVA’s multi-log design achieves high performance, strong consistency, and fault tolerance on NVMM • NOVA supports consistent snapshot with DAX-mmap() • NOVA outperforms existing file systems while providing stronger atomicity and data integrity https://github.com/NVSL/linux-nova
Thank you! https://github.com/NVSL/linux-nova