TLS handshake failure on Gen5 HughesNet transport

Hello,

My company has built an IoT water treatment product that is currently in soft release. (https://dropconnect.com/) The core of the system is a WiFi-enabled hub based on Microchip’s PIC32 microcontroller and we use Mbed TLS both in server mode (iOS and Android apps communicating with the hub) and client mode (hub communicating with AWS IoT Core). In general, Mbed TLS is working beautifully. However, the TLS handshake process is failing in one specific use-case. I’m not of the opinion that it’s actually a problem with Mbed TLS; I’m mostly looking for advice regarding the best approach in diagnosing the problem.

The problem is the TLS handshake with the hub as the client authenticating with AWS IoT, specifically over a HughesNet satellite connection where the modem is their Gen5 HT2000W model. The only HughesNet satellite network I personally have been able to test with is their Gen4 model, and the TLS handshake succeeds without error over that link. The customers with Gen5 HughesNet systems are, unfortunately, about 800 miles away so I’ve only been able to gather limited diagnostic data through custom firmware updates designed to report the progress through the TLS handshake.

Before the TLS handshake fails, it progresses normally through the process up until the MBEDTLS_SSL_SERVER_CHANGE_CIPHER_SPEC step. Prior to starting that step, I’ve confirmed that network transmissions and receipts are happening normally and the same TLS cipher is negotiated as compared to a successful connection. (MBEDTLS_CIPHER_AES_256_GCM) Once the handshake reaches that step, mbedtls_ssl_parse_change_cipher_spec() always returns MBEDTLS_ERR_SSL_WANT_READ, indicating that the expected response from the AWS IoT Core server is either never sent or HughesNet is dropping the packet somewhere.

We had previously been using Mbed TLS 2.4.0 from 2016 but recently upgraded to 2.16.1. Despite the many obvious improvements made to Mbed TLS between those versions, the behavior over the Gen5 HughesNet link while using 2.16.1 is exactly the same as before. The average latency over a satellite link is about 640ms, but the network conditions don’t appear to be a problem. I’ve simulated network conditions with 2,000ms latencies and terrible bandwidth caps and the TLS handshake still works properly, although much more slowly.

Can anyone suggest a diagnostic approach or specific state information to verify at the point where the communication breaks down?

Thanks,

Patrick Frazer
Chandler Systems, Inc.

Hi @pfrazer
As you can see from here, the error could also be returned when you receive a non fatal non closure alert from the remote peer.
Have you enabled debug logs?
Regards,
Mbed TLS team member
Ron

I haven’t been able to use the debug logs yet to diagnose the problem because the only running systems that exhibit the problem are installed at remote locations. I’ve been relying on my application-layer diagnostic log stored in flash memory on the device to trace the progress through the handshake. We plan to have the same residential satellite equipment installed at our office soon, so I expect that I’ll be able to get more granular logging.

If mbedtls_ssl_read() sees a message of type ‘MBEDTLS_SSL_MSG_ALERT’ and returns WANT_READ, do I need to do anything to pop that message off the stack? My wrapper around Mbed TLS calls mbedtls_ssl_handshake() repeatedly until it either returns zero, returns an error that is not WANT_READ or WANT_WRITE, or a very long timeout expires. I assumed that the message type you referenced (if one was actually received) would pass a WANT_READ result all the way back up to where mbedtls_ssl_handshake() was called, at which point my wrapper would call it again to process the next message. Currently, my wrapper is timing out after mbedtls_ssl_handshake() returns WANT_READ for more than a minute.

Thanks for the suggestions. I’ll try to follow up with more detail after we get the satellite equipment installed locally.

Patrick

Hi,

Since my last post, we’ve installed a HughesNet satellite dish at our office so that I can reproduce the problem and I’ve enabled detailed logging of the TLS handshake process. I’ve compared the output for both a failed handshake over the satellite connection and a successful handshake over a landline connection, and have found that the handshakes are in lockstep right up to step 12 (MBEDTLS_SSL_SERVER_CHANGE_CIPHER_SPEC). As previously suspected, that step hangs up when it never receives the expected response from the remote server.

In the following log fragments, the lines that start with a timestamp are my own application-layer diagnostic messages and lines starting with ‘//’ are my own comments. Here is a fragment of the log output from a successfull connection:

// Step 11 beginning...
06/04 07:51:41.888 I SysLog     [DIAG] IOT TLS handshake step (2830187) Data: 11/0
06/04 07:51:41.888 I LogNVM     Write: ChipSelect: 01, VAddr: 257C, PAddr: 257C, Count:6
ssl_cli.c:3510: client state: 11
ssl_tls.c:2760: => flush output
ssl_tls.c:2772: <= flush output
ssl_tls.c:6419: => write finished
ssl_tls.c:6293: => calc  finished tls sha384
ssl_tls.c:6317: dumping 'calc finished result' (12 bytes)
ssl_tls.c:6317: 0000:  51 70 e3 62 d9 72 48 ec 2e f5 c2 00              Qp.b.rH.....
ssl_tls.c:6323: <= calc  finished
ssl_tls.c:6464: switching to new transform spec for outbound data
06/04 07:51:42.406 I SysLog     [DIAG] IOT cipher (2830239) Data: 16/1
06/04 07:51:42.406 I LogNVM     Write: ChipSelect: 01, VAddr: 2581, PAddr: 2581, Count:7
ssl_tls.c:3189: => write handshake message
ssl_tls.c:3348: => write record
ssl_tls.c:1447: => encrypt buf
ssl_tls.c:1620: before encrypt: msglen = 24, including 0 bytes of padding
ssl_tls.c:1783: <= encrypt buf
ssl_tls.c:3428: output record: msgtype = 22, version = [3:3], msglen = 40
ssl_tls.c:2760: => flush output
ssl_tls.c:2779: message length: 45, out_left: 45
ssl_tls.c:2784: ssl->f_send() returned 45 (-0xffffffd3)
ssl_tls.c:2812: <= flush output
ssl_tls.c:3481: <= write record
ssl_tls.c:3325: <= write handshake message
ssl_tls.c:6534: <= write finished
06/04 07:51:42.414 I SysLog     [DIAG] IOT TLS handshake step (2830240) Data: 12/0
06/04 07:51:42.415 I LogNVM     Write: ChipSelect: 01, VAddr: 2587, PAddr: 2587, Count:6
ssl_cli.c:3510: client state: 12
ssl_tls.c:2760: => flush output
ssl_tls.c:2772: <= flush output
ssl_tls.c:5907: => parse change cipher spec
ssl_tls.c:4316: => read record
ssl_tls.c:2538: => fetch input
ssl_tls.c:2701: in_left: 0, nb_want: 5
ssl_tls.c:2726: in_left: 0, nb_want: 5
ssl_tls.c:2727: ssl->f_recv(_timeout)() returned 5 (-0xfffffffb)
ssl_tls.c:2747: <= fetch input
ssl_tls.c:4061: input record: msgtype = 20, version = [3:3], msglen = 1
ssl_tls.c:2538: => fetch input
ssl_tls.c:2701: in_left: 5, nb_want: 6
ssl_tls.c:2726: in_left: 5, nb_want: 6
ssl_tls.c:2727: ssl->f_recv(_timeout)() returned 1 (-0xffffffff)
ssl_tls.c:2747: <= fetch input
ssl_tls.c:4390: <= read record
ssl_tls.c:5940: switching to new transform spec for inbound data
06/04 07:51:42.922 I SysLog     [DIAG] IOT cipher (2830291) Data: 16/2
06/04 07:51:42.922 I LogNVM     Write: ChipSelect: 01, VAddr: 258C, PAddr: 258C, Count:7
ssl_tls.c:5987: <= parse change cipher spec
06/04 07:51:42.928 I SysLog     [DIAG] IOT TLS handshake step (2830291) Data: 13/0
06/04 07:51:42.928 I LogNVM     Write: ChipSelect: 01, VAddr: 2592, PAddr: 2592, Count:6
ssl_cli.c:3510: client state: 13
ssl_tls.c:2760: => flush output
ssl_tls.c:2772: <= flush output
ssl_tls.c:6551: => parse finished
ssl_tls.c:6293: => calc  finished tls sha384
ssl_tls.c:6317: dumping 'calc finished result' (12 bytes)
ssl_tls.c:6317: 0000:  8f f9 3e 9a 82 cb 1a a7 86 ff af a2              ..>.........
ssl_tls.c:6323: <= calc  finished
ssl_tls.c:4316: => read record
ssl_tls.c:2538: => fetch input
ssl_tls.c:2701: in_left: 0, nb_want: 5
ssl_tls.c:2726: in_left: 0, nb_want: 5
ssl_tls.c:2727: ssl->f_recv(_timeout)() returned 5 (-0xfffffffb)
ssl_tls.c:2747: <= fetch input
ssl_tls.c:4061: input record: msgtype = 22, version = [3:3], msglen = 40
ssl_tls.c:2538: => fetch input
ssl_tls.c:2701: in_left: 5, nb_want: 45
ssl_tls.c:2726: in_left: 5, nb_want: 45
ssl_tls.c:2727: ssl->f_recv(_timeout)() returned 40 (-0xffffffd8)
ssl_tls.c:2747: <= fetch input
ssl_tls.c:1796: => decrypt buf
ssl_tls.c:2378: <= decrypt buf
ssl_tls.c:3631: handshake message: msglen = 16, type = 20, hslen = 16
ssl_tls.c:4390: <= read record
ssl_tls.c:6619: <= parse finished
// Step 13 completed, handshake continues to completion...

Here is a log fragment from a failed handshake over the satellite connection:

// Step 11 beginning...  Up to this point the output exactly matches a successful connection
// except where expected due to randomized data.
06/04 07:54:18.897 I SysLog     [DIAG] IOT TLS handshake step (2845888) Data: 11/0
06/04 07:54:18.897 I LogNVM     Write: ChipSelect: 01, VAddr: 2631, PAddr: 2631, Count:6
ssl_cli.c:3510: client state: 11
ssl_tls.c:2760: => flush output
ssl_tls.c:2772: <= flush output
ssl_tls.c:6419: => write finished
ssl_tls.c:6293: => calc  finished tls sha384
ssl_tls.c:6317: dumping 'calc finished result' (12 bytes)
ssl_tls.c:6317: 0000:  4c c3 3c 85 22 c4 ca 7c 89 4c ab 97              L.<."..|.L..
ssl_tls.c:6323: <= calc  finished
ssl_tls.c:6464: switching to new transform spec for outbound data
06/04 07:54:19.412 I SysLog     [DIAG] IOT cipher (2845940) Data: 16/1
06/04 07:54:19.412 I LogNVM     Write: ChipSelect: 01, VAddr: 2636, PAddr: 2636, Count:7
ssl_tls.c:3189: => write handshake message
ssl_tls.c:3348: => write record
ssl_tls.c:1447: => encrypt buf
ssl_tls.c:1620: before encrypt: msglen = 24, including 0 bytes of padding
ssl_tls.c:1783: <= encrypt buf
ssl_tls.c:3428: output record: msgtype = 22, version = [3:3], msglen = 40
ssl_tls.c:2760: => flush output
ssl_tls.c:2779: message length: 45, out_left: 45
// The transmission in step 11 appears successful:
ssl_tls.c:2784: ssl->f_send() returned 45 (-0xffffffd3)
ssl_tls.c:2812: <= flush output
ssl_tls.c:3481: <= write record
ssl_tls.c:3325: <= write handshake message
ssl_tls.c:6534: <= write finished
06/04 07:54:19.420 I SysLog     [DIAG] IOT TLS handshake step (2845941) Data: 12/0
06/04 07:54:19.420 I LogNVM     Write: ChipSelect: 01, VAddr: 263C, PAddr: 263C, Count:6
ssl_cli.c:3510: client state: 12
ssl_tls.c:2760: => flush output
ssl_tls.c:2772: <= flush output
ssl_tls.c:5907: => parse change cipher spec
ssl_tls.c:4316: => read record
ssl_tls.c:2538: => fetch input
ssl_tls.c:2701: in_left: 0, nb_want: 5
ssl_tls.c:2726: in_left: 0, nb_want: 5
// At this point, mbedtls_ssl_fetch_input() returns WANT_READ and the handshake step exits
// and retries.  We never see any data from the server again and the connection attempt
// eventually times out.
ssl_tls.c:8172: <= handshake
ssl_tls.c:8122: => handshake
ssl_cli.c:3510: client state: 12
ssl_tls.c:2760: => flush output
ssl_tls.c:2772: <= flush output
ssl_tls.c:5907: => parse change cipher spec
ssl_tls.c:4316: => read record
ssl_tls.c:2538: => fetch input
ssl_tls.c:2701: in_left: 0, nb_want: 5
ssl_tls.c:2726: in_left: 0, nb_want: 5
ssl_tls.c:8172: <= handshake
// The previous 10 lines between '=> handshake' and '<= handshake' repeat every 5 seconds (the
// defined timeout for a read) until my code aborts the handshake after a minute)

In both cases, my code is the client and Amazon AWS IOT Core is the server. I see three logical explanations for what I’m seeing:

  1. The data transmitted in client step 11 is never received at AWS.
  2. AWS receives the client cipher spec change message in step 11 but never responds.
  3. AWS sends the data expected by my client in step 12 but it is never received by the client.

I view theory 2 as being very unlikely and assume that the satellite carrier is dropping packets for some reason. Can anyone offer suggestions on what I should be digging into next? How can I determine where the data loss is happening?

Thanks!

Patrick

Hi Patrick,
Do you have access to the server logs?
I doubt it…
Can you sniff the network, to see if the packet was sent \ received?
I believe that the server received the client cipher spec, there was something wrong in the message, such as bad mac, and terminated the connection, without sending a fatal alert.

Regards

Ron,

No, we don’t have access to any logs on Amazon’s end. I set up WireShark to log the network traffic an I believe I’ve confirmed that no response is being sent by AWS after the client change cipher spec message is sent.

Probably the most useful discovery came after I modified the ssl_client1.c project included in the Visual Studio samples in the Mbed TLS project to connect to our AWS endpoint and use the same X.509 certificates that I’ve been testing with on our embedded hardware. That modified project is able to complete the TLS handshake both over our landline ISP as well as over the HughesNet Gen5 link. I don’t feel like I’m any closer to a solution, but it does seem to rule out Mbed TLS as an integral part of the problem. The implication is that the problem is timing related (possibly due to delays added as our little PIC processor crunches numbers during the TLS handshake) or rooted in Microchip’s TCP/IP stack library.

I’ll keep working and will update this thread as I learn new things…

Patrick

Just adding an update on my situation…

I observed that the socket reads from the first response from the AWS server looked a bit odd; the ssl->f_recv() function was returning 1393 bytes, then 2 bytes, then the balance of the message. The default receive socket input buffer as defined in Microchip’s TCP/IP library was 1024 bytes, and after increasing that buffer to 1460 bytes (typical max TCP MTU of 1500 less two 20-byte headers) the problem simply went away. I haven’t fully studied the cause yet, but it appears that the previous buffer size was causing a corruption in the two bytes in the middle of that read operation. The AWS TLS implementation must have been silently giving up on the conversation due to that corruption.

Thanks to everyone who applied brain cycles thinking about this issue; I believe it’s been resolved.

Patrick

Hi Patrick,
I’m glad the issue is resolved.
I believe the issue was some sort of message fragmentation, that the server sent a message larger than what was configured in your library.
You could look at Mbed TLS over low-bandwidth, unreliable datagram networks — Mbed TLS documentation for additional information on DTLS related issues. (Although you were working on TLS)
Regards,
Ron