Falcon: the new Transactional Storage Engine

Jim Starkey

Falcon is based on the netfrastructure db engine
Netfrastructure has been deployed in mission critical apps for >4 years.
Extended and integrated into mysql environment.

What Falcon is NOT:
InnoDB clone
Firebird (open source derivative of Interbase db that Jim wrote years ago)
Firebird clone
Standalone Database Management System (it was, inside of netfrastructure engine)
Netfrastructure (netfra is much more with jvm and search, though these features may roll out later)

What Jim’s learned in 20 years
Disks are slower than CPU and memory than they were 25 years ago.
MVCC=Multi-generational concurrency control (how Jim named it, but someone changed it to “version”)
Putting record versions on disks are problematic
Web applications are better and for the future (religion) [I agree, though, for portability]
People have more important things to do than tune databases

Claim: Falcon is the engine design for the net 20 years.
Goals:
Exploit large memory for more than just a bigger cache
Use threads and processors for data migration
Eliminate tradeoffs, minimize tuning
Scale gracefully to use very heavy loads

Basic Architectural Model:
Incomplete in-memory db with backfill from disk
2 caches: 1) traditional LRU page cache for disk and 2) larger row cache with age group scavenging
Serial log for single write group commits — single write-ahead log.
Multi-version in memory, single version on disk
all transaction states are in memory with automatic overflow to disk
Data and indexes are 1 file plus log files (MySQL does not do this, but most other db servers do)
future: blob repositories (put them in different area of db); multiple page spaces

Basic model is MVCC. It will be extended for relaxed consistency (but why would you want to do that?!?!), and will be extended for serializable.

Index Implementation:
Btree index with prefix compression — no difference in performance with primary or secondary indexes.
No data except key in index
2-stage index retrievals — index scan generates row bitmap, so you can get from 2 indexes before going to rows; records are fetched from disk in physical row order

Data Flow:
Uncommitted row data is staged in memory
On commit, txn copies row data to serial log and written to disk (not committed until OS says the page was written)
post commit, dedicated thread copies row data from serial log to data pages
Page cache periodically flushed to disk (except blob data)
BLOB data is queued for write at blob creation, backed out on rollback — otherwise it wastes time putting it into log.

Data Reliability:
Physical structure protected by “careful write” — db on disk is ALWAYS valid and consistent; a page is written BEFORE the pointer to it is made. So worse comes to worse, you have an orphan page and NOT a null pointer. Orphaned pages will be found by
Atomicity protected by serial log — a transaction is committed when commit record hits the oxide.
The serial log is a “do log, for post commit data migration; “redo” log for post-crash recovery of data; “undo” log for post-crash resource recovery.

Jim’s Secret Agenda: not so secret anymore!
Replace varchar with string (varchar is an ABOMINATION left over from punch cards)
Replace tiny, small, medium, big ints with “number” (set limits if you want…)
Adopt a security model that’s useful for app servers (app server logs on to db server and THEN sets up security, but by then it’s too late. 3rd party client security control should have been put there 15 years ago)
Introduce useable row level security (filter sets, so querying does not accidentally give out the wrong info to the wrong people)
Teach the database world the merits of free context search, that everyone else already knows. (do you type a SELECT statement into Google?)

“The difference between theory and practice: in theory, there isn’t any difference.”

Comments are closed.