A Serious Z-Wave Security Discussion in [Market-Ticker]

2018-05-29 21:20 by Karl Denninger
in Technology , 427 references

A Serious Z-Wave Security Discussion *

[Comments enabled]

So this blew up on Twatter today, after the author of an article I went after on his blog figured out who I was.

Here's the article that I went ape-shit over.

TL;DR: Stronger S2 Z-Wave pairing security process can be downgraded to weak S0, exposing smart devices to compromise.
Z-Wave uses a shared network key to secure traffic. This key is exchanged between the controller and the client devices (‘nodes’) when the devices are paired. The keys are used to protect the communications and prevent attackers exploiting joined devices.
The earlier pairing process (‘S0’) had a vulnerability – the network key was transmitted between the nodes using a key of all zeroes, and could be sniffed by an attacker within RF range. This issue was documented by Sensepost in 2013. We have shown that the improved, more secure pairing process (‘S2’) can be downgraded back to S0, negating all improvements.
Once you’ve got the network key, you have access to control the Z-Wave devices on the network. 2,400 vendors and over 100 million Z-wave chips are out there in smart devices, from door locks to lighting to heating to home alarms. The range is usually better than Bluetooth too: over 100 metres.

Ok, so the claims are basically:

1. S2 is better than S0 (true; it's faster mostly.) S2 also allows for user-initiated keying exchange with a shared secret of sorts (e.g. a pin code, etc.)

2. The latter is important because during the setup of an encrypted device you have to get the key into the device somehow. Of course if that key is shared and not hashed with each use by something unique to each endpoint then if you get the key you have it for every secure thing that speaks with the other end!

Oh, and "100 million devices and 2,400 vendors!" My God, it's full of stars!

Except..... 90+% of those devices do not support encryption at all. Your common light controls, thermostats, PIRs (motion detectors), etc -- nearly all of them run without any encryption. They don't get turned on by the neighbors only because their network ID is different, but that's not actual security. Newer devices support decent (CRC16) integrity checking, but older ones don't. Don't write that older stuff off though -- despite some misbehavior the old Intermatic CA9000 PIRs are arguably the most-rugged on offer and one of the best options if you don't need pet-proofing, the older Leviton Vizia series of switches and controllers are extraordinarily reliable, etc. Encryption support is not "free"; it requires "nonces" to be sent around, which consume network traffic, and of course there's a CPU requirement to encrypt and decrypt. All this means response time is impacted. You choose the trade-off. And be careful how you choose it -- for example, if you have a motion detector outside and it's running encrypted the key is in the unit and I could just steal it and then extract the key from the NVRAM at my leisure! Theoretically the SOCs in these units should prevent that. Theoretically.

Typically you find encrypted mode support where it matters, which is in devices like locks. For obvious reasons anything that operates as a lock (e.g. a garage door opening device) without encryption is no lock at all and the part containing the key is in the protected space (in other words to steal it you must first break in, at which point the discussion is academic.)

One of the reasons "universal" S0 is not supported is that it is fairly "heavy" in terms of network and processor (battery, etc) load. S2 does address this to a material degree so when it is available on a "nearly-universal" basis for devices it'll be a "good thing." But that day is not today, and probably won't be tomorrow either. In fact I'll bet less than 1% of all Z-wave units in use today support any encryption whatsoever.

Now let's talk about how S0, which is the "default" secure implementation (the one that's actually in units today), works for a new device.

When you pair a new device the exchange goes like this:

The controller {C} is put into pairing mode (MANUALLY!)

The device {D} is poked (when "clean" pretty-much anything pokes it -- a button press, etc.)

{D} Hi, I am a device of type X and I like you! --> {C}

{D} <<< Ok, tell me what you are, here's your node number and network id {C}

{D} Thanks, I'm a device type Z here's a few things you should know --> {C}

Now the node talks a bit, along with the controller, and figures out what's in range so it knows how to build it's idea of the mesh of the nodes around it. It can do this later too, but you REALLY want this to be right or the performance of the network goes to hell FAST.

Then we look up that "few things you should know" set of data (is the unit always listening, what bitrate does it run at, etc) and looks for a flag that says "I know how to talk encrypted." If it finds that specific flag set to "on" this happens during a very short window of time (100ms or so):

{D} -----> I want scheme X ----------> {C}

{D} <<< Ok, that's cool, give me a Nonce so I can send encrypted {C}

{D} ----> Here it is ------> {C}

{D} << Here's your network key {C}

Now at this point if we're doing "S0" the potential issues arises. The node has provided a "nonce", which permutes the encryption so you can't repeat a packet and have it work twice. (Number once is what "nonce" stands for.) But the node doesn't have a key yet. So there's a hard-coded "pairing" set of data which the folks say has a "zero key" -- accurate as far as it goes, but not really because there are three components to the IV (what you initialize the encryption algorithm with) and only one of them is zeros. Not that it matters in practice, because they all have to be hardcoded, so figuring out what they are is a matter of disassembling someone's controller code or any device's microcode. But it is absolutely inaccurate to say that the encryption is initialized with all zeros -- it most-certainly is not!

(Remember, the device has to have the same hardcoded initialization value set in it... it does, so it can decrypt that packet and does so, then immediately replaces the working key with the network key it just received)

{D} Here's a reply confirming the key set operation --------> {C}

The problem is that during that little bit of time, specifically, that last bolded line, that specific packet can be picked off and since the keying is known you can theoretically steal the key.

If you do then you can proactively read (or send) traffic because the nonces are sent across and thus you have access to them.

In short if you can steal the key well, you stole the key!

The gist of the article from PenTest deals with the S2 scheme, which is more-secure because the user can be prompted to seed the exchange from the console (e.g. "punch in a 5 digit code on the lock, and the same code on the controller") and this makes it a lot harder to rip off the keying. Further, S2 uses a formal key exchange mechanism so stealing the key isn't a bit harder, it's a lot harder, provided there's a shared secret. In fact, it's basically impossible. This is great.

So where's the problem?

Right now there are no commercial controllers that do S2, largely because there are damn near no devices that do S2! I haven't implemented it yet on HomeDaemon simply because I don't have anything that can run it here, and while I could certainly implement against a single device I'd like to have a few of them to make sure the code is actually stable and I'm not implementing and testing against a buggy implementation that some random manufacturer put out.

Which is quite possible, by the way -- don't get me started on that or I'll talk for hours.....

The "attack" put forward by the original article is an intentional downgrade attack. In other words by jamming the device's communications or otherwise tampering with them (remember, this is RF, so you CAN tamper with the transmission by jamming or other means) you can damage the reply packet from the node that says it can do S2.

This will cause the controller to attempt to request compatibility with S0 since it thinks the node cannot do S2, which the devices also support.

Now, during that immediately-forthcoming forced S0 exchange you steal the key.

Note that this is exactly the same risk that exists for any S0 device -- originally, now and forevermore. It is not unique to the newly-minted S2-capable units. In fact for an S0 unit there's no need to jam anything.

That sounds ugly, and it would be except for some realities that get in the way of it being ugly in actual practice.

First, the window of exposure is very small and cannot be triggered from the outside. The controller has to be told to pair, the node has to be told to pair, and then you have to be able to both jam and intercept during a very specific and small window of time. The frame time for the scheme reply to be valid is about 100ms and if you're off then the node comes up unsecured entirely. And... you don't get the key because the controller never sends it as there's no agreement on the scheme to be used. Oh by the way if the node is a lock (I have one sitting here on my desk bolted to two small piece of wood that I use for testing) and it includes insecure then the "lock" functions are missing. Good, because that will force you to exclude and re-include it so you actually get secure mode (you do want a lock to be able to be locked, right?)

Second and probably the best defense is that best practice is that you pair in low-power mode. In other words you remove the RF stick from the controller, physically take it to the new device and push a button on it to initiate the operation -- you do not do it from the command console. In that case the range of the pairing transmission from the controller, which is the only one you care about (since it contains the key) is inches, not 100' or more.

Now for convenience newer nodes (and controllers) can initiate pairing from a distance. In fact all the 500-series chipset stuff supports doing so. However, there's a lot of older gear out there that works perfectly well but can't handle high power include, including roughly half or more of the devices in my house. In fact the standard for setting up Zwave devices and best practice has always been to do so in the final installed location and to pair with the controller at the device. This is easy when the master is a handheld controller roughly the size of a small remote control as was originally the case with the original Leviton master controllers years ago. Of course this is sort of hard to do when your controller is this thing that has a wall-wart, doesn't have a separate RF interface and it's running your user interface at the same time! In other words convenience and poor design of some controllers (essentially all of the mass-market stuff, I might add) means you get to bring the device to the controller, pair it there, and then deal with an inevitable network reorganization to get good performance.

That, by the way, is another sore spot in that many controllers try to do it on their own which is flat-out stupid. The scope of why is beyond this article (although it's covered in some depth in HomeDaemon's user manual as a caution to people who would try to use those commands without understanding them) but it has to do with the fact that most battery-powered devices are not listening all the time and in order to get a good network map every device has to be on and able to receive and transmit. Good luck with that on an automated basis where you have anything that runs on batteries in the network. And if you think this is a "theoretical" pain in the ass there are 53 active Z-wave units in my house right now. There's nothing theoretical about running around removing the covers, sending configuration commands and similar on over a dozen battery-powered devices so they're all "awake" and can properly participate in a network rediscovery process! "Best practice" exists for a reason and especially with complex installations it's important to follow it for reasons other than security -- that's a nice side effect.

So the long and short of it is that these guys consider this a protocol problem and severe vulnerability.

I called bullshit on that and they didn't like it.

Here are my reasons; you decide who's right (with their full source article up above):

1. You cannot initiate pairing from a remote, nor over RF. You have to put the controller in that mode deliberately, and if it is not then it will ignore a unit that tries to perform pairing, never responding to the request at all. Since it never responds it cannot divulge a key. Therefore you need a deliberate act by the owner or system installer to first open the potential vulnerability in the first place.

2. You must then initiate pairing on the new unit. Now this is where things could get sort of ugly; a malefactor could put a "bare" (uninitialized) unit outside your house but within range and pair that. Then again if they can manage that they can steal the key no matter what because they then take the physical unit. If you can do that you can build a confederate unit that is designed to get keys and then display them for you. Bingo -- Bob's Your Uncle. Note that the only defense against that if you pair in high power is #1 because the unit cannot initiate pairing on its own.

3. Best practice is to perform pairing with the unit in the installed final location using a controller that is operating in low-power mode to pair. This reduces the potential interception range to inches from ~100' or so whether the intruder is using custom-designed equipment or a simple sniffer. If a confederate can get a listening device within a foot of you when you are doing this then he can also put a fake node in the same place, trigger it to include whenever it sees some other node trying to do so, and steal the key the hard way -- by retrieving the device later and extracting it from the device's NVRAM. S2 mitigates this if there's a user-controlled PIN or similar used, obviously, since the confederate would not know what it is nor have a way to enter it and he needs that for the initial keying exchange to be decodable. Note that it does not matter whether the node runs in low-power during inclusion or not, since the node doesn't send the key -- the controller does, and by the time the node sends an encrypted message it has the actual network key in it and the risk window is closed. If you have a controller that is fixed-location and doesn't have a removable stick that's an implementation problem and stupid design of the controller, not the protocol, but even that can be overcome -- keep reading.

4. Once keyed it doesn't matter. In other words the risk is only in forcing a fallback from S2 -> S0. Further, the standards say that if you do that you're supposed to warn the user. In fact the fallback chain is S2 -> S0 -> Insecure, and that happens sometimes when including S0 and you get to start over because RF noise or similar corrupts one of the packets; they pass checksum (1 byte and thus not much for integrity; a 1-in-256 chance the packet is smashed but the checksum is good) but fail MAC validation (VERY solid on integrity) and the other end cannot possibly discern what was being said, since the packet doesn't decrypt. Indeed if the MAC fails you don't even know what the transmission was and the underlying protocol does not have a "repeat last message please" request either. This happens fairly regularly by the way in ordinary operation; I get MAC/NONCE errors (one of them is bad and thus the decrypt fails, but which? No way to know) quite regularly on one of my units that's installed in a metal box and thus the RF is sort of nasty-attenuated. It still works fine and I leave it that way intentionally as a code-robustness test but a fairly decent number of decrypt failures get logged. And yes, HomeDaemon-MCP does tell you explicitly whether a node is included secure or not, both initially and permanently on the main node display in that secure units have a "*" after their name.

Incidentally this is a severe weakness in the Z-Wave spec but it's not a security concern, it's an operational one. If you send a packet and it passes checksum the underlying RF protocol considers it perfectly fine and the originating device (which you don't control if it's not your controller where you wrote the code!) can and will overwrite its buffer; that is, it considers the message "delivered" and all is good. Well, it might not be. If the CRC16 or MAC computation fails you know you have trash instead of a valid packet but no way to ask for a retransmit. There's a fair bit of code in HomeDaemon that does its level best to prevent that from being operationally significant but sadly if the original event was an asynchronous report (that is, you didn't solicit it so you don't know what it was supposed to be) there's really not much you can do other than log the fact that you got something you can't successfully (or safely) process.

So what would have been a more responsible way to look at this issue from a standpoint of what you could reasonably ask the Z-wave people to do as a means of immediately mitigating this (modest, but real) risk?

There's a very simple mitigation that could be made without breaking backward compatibility in any way: Change the specification for the controller code so that any transmission of Class 98 (Security), Subclass 06 (Keying) always goes out at low power.

That's the end of the problem in real terms; now the interception range of that frame, which is the specific one at risk, is measured in inches instead of tens or even 100'. And, pleasantly, that's a fix that can be put into controller firmware by manufacturers as a firmware update.

What did I do?

Well, I can't get into the firmware of the RF chip that is used in the Aeotec stick, so I can't fix it the way I'd like, although I can certainly recommend that Aeotec do so, and will.

But what I can do and did is set HomeDaemon's code to explicitly not ask for either high-power or network include unless you tell it to include non-secure only.

HomeDaemon has always had two commands to add nodes; "add-node" and "add-node-nosecure"; the latter intentionally ignores a security scheme request. In both cases the "add node" command stanza to the controller includes options that (1) can constrain the type of node that it will accept, whether NWI (network-wide includes or only direct) and (2) whether high-power transmission is used. So the simple delta to the code was to remove those two flags from the "add node" request unless you've blocked secure inclusion for that device.

In other words if you force insecure include mode, that is, you won't answer a request for keying even if the node sends one, then there's no harm in a high-power inclusion since there's no keying sent. But if you do a "regular" include which allows for secure mode negotiation then high power mode is not requested, nor is network-include.

Never mind that you shouldn't do the potentially-risky thing in the first place; the best and proper way to include a node is to pull the stick out, take it to the node and use the button on the stick to include the new device. That's how I do all my node includes in testing and recommend it generally. The range of that operation is inches by my direct experience just like it was with my old Leviton handheld master (which I still have, by the way, although I don't use it anymore) and again, if someone can get a device that can pick off the signal from that close when you're doing it they can just put a fake node out there, pair that, and then retrieve it and steal the key directly.

Incidentally S2 is not a panacea. In order for it to work the seed for the ephemeral key has to be known somehow. DHE is (elliptic curve) is fabulous and the curve they chose (25519) is the same one that is used by modern ssh2 clients and servers; it has the attractive property of having reasonably-short keying so you can rationally print or barcode it on a unit somewhere. Since a unit (wall switch, etc) has a very limited amount of storage and CPU power that choice was also driven in part, I'm sure, by those constraints. Of course if the pairing code is printed on a wall switch and you need to re-pair it you might find it at least moderately annoying to have to remove the switch from the wall to get the pairing code again, and God Help You if the barcode or printing is damaged since there's absolutely no way to guess it. S2 has a second attractive property in that there are multiple "levels" of trust (three) and does forward nonce computation (assuming there aren't RF errors, which can happen and forces a nonce resync) which reduces both traffic and power consumption. Those are both good and that's enough reason to support it standing alone once there is a decent selection of S2 capable units available for purchase.

So that's my read on this alleged "terrible" vulnerability.

I'll support S2 if and when I can get my hands on some devices to proof the code against. It shouldn't be that hard to add; the methods are decently documented.

But until then, and especially if you have devices that use S0 (the "usual" Zwave security) now, take a chill pill instead of the clickbait. Understand the threat and attack surface -- it's small but real, and you have to do dumb things in order for it to be a problem. I don't consider this a protocol fault by any means; S2 is faster and, while the keying is better, you typically key a device exactly once in a given installation, so the actual attack surface is likewise very small.

In short be aware of best practices, follow them, and if very odd, unexplainable things start happening ask questions before you just start wildly doing something that some malefactor may want you to do.

You wouldn't answer the phone and give the caller your social security number or garage code; treat this the same way.

Go to responses (registration required to post)