ARMmbed

Mbedtls_ssl_handshake() segfault after ~1000 iterations

it looks like solved.

I added mbedtls_net_free() whenever I create a mbedtls_net_context, right after declaring it. Correct?

This makes the 10000 iterations pass with the above example as well as with my production code.

I am not yet sure about my server. Will try to produce a example if I can repeat that one.

Nope, the server can still be crashed by loading a cert without chain into the client. The backtrace looks very similar:

#3  0x00007fb325fe5870 in mbedtls_ssl_handshake () from /usr/local/lib/libmbedtls.so.12
#2  0x00007fb325fdffc8 in mbedtls_ssl_handshake_server_step () from /usr/local/lib/libmbedtls.so.12
#1  0x00007fb325fdcf85 in ssl_parse_client_hello () from /usr/local/lib/libmbedtls.so.12
#0  0x00007fb325fe8dee in mbedtls_ssl_fetch_input () from /usr/local/lib/libmbedtls.so.12

(i have not example code yet)

In my thread test - that is when I have 2 threads creating such EchoClient objects - it still crashes after 1000-1200 objects were created. The number varies. Three runs gave me: 1063, 1177, 1099 objects created+destroyed before segfault.

Its always in mbedtls_ssl_fetch_input().

(i need to adjust my example to show that here)

Hi,
It sounds to me like a concurrency issue, do you protect your data with mutexes?

yes, its mutexed.

I am working on a example without our own library , to rule that out. (it handles the threads and mutexes).

So far I fail to reproduce my production crash with a simplified example.

I will go another route: looking into mbedtls itself using the debugger to see precisely where mbedtls_ssl_fetch_input() crashes and on which value(s), the FD in particular.

(I could not do this yet, because I am linked against a general purpose build installed in the OS - it needs to be added into my apps build system first, which is automake)

now I have a mbedTLS build made with this:

cmake3 -D CMAKE_BUILD_TYPE=Debug -DCMAKE_INSTALL_PREFIX=${MBEDTLS_PATH}/dbg -DUSE_SHARED_MBEDTLS_LIBRARY=On -DENABLE_PROGRAMS=Off ../mbedtls_src

interestingly it does not segfault with that.

After 1019 iterations , however, my program hangs. The 2 threads stop finishing their tasks, so main() waits on the list of tasks.

When that happens I seen in /proc/<pid>/fd/ that my file descriptors are exhaused.

These file descriptors are of type anon_inode:[eventpoll]. They are not socket:[XYZ] which mbedtls_net_free() closes, right?

That closing part works then and I need to loop why epoll is not cleaning up.

Hello Ron,

can I have two SSL configurations in one process? (+ two x509 certs + two private keys + two DRBGs with entropy for each)

the reason I ask this comes from looking at race conditions using the Helgrind tool. It says things like this:

==2595== ----------------------------------------------------------------
==2595== 
==2595== Possible data race during read of size 4 at 0x5744410 by thread #3
==2595== Locks held: none
==2595==    at 0x54FC71D: mbedtls_ecp_grp_id_list (ecp.c:446)
==2595==    by 0x50A080A: mbedtls_ssl_config_defaults (ssl_tls.c:9205)
==2595==    by 0x4E5948F: SxSocket::initTLS(SxString const&, SxString const&, SxString const&) (SxSocket.cpp:200)
==2595==    by 0x4E5A52B: SxSocket::SxSocket(SxAddrInfo const&, SxString const&) (SxSocket.cpp:51)
==2595==    by 0x404CBA: create<SxAddrInfo, char [25]> (SxPtr.hpp:146)
==2595==    by 0x404CBA: EchoClientTask::main() (echoclient.cpp:101)
==2595==    by 0x59E56BB: SxThread::mainWrap() (SxThread.cpp:148)
==2595==    by 0x59B453C: SxThreadPool::PoolThread::main() (SxThreadPool.cpp:70)
==2595==    by 0x59E28E8: SxSystemThread::launcher() (SxSystemThread.cpp:118)
==2595==    by 0x59E2B7E: SxSystemThread::loopCallback(void*) (SxSystemThread.cpp:179)
==2595==    by 0x4C3081E: mythread_wrapper (hg_intercepts.c:389)
==2595==    by 0x66A1DD4: start_thread (in /usr/lib64/libpthread-2.17.so)
==2595==    by 0x6ECCEAC: clone (in /usr/lib64/libc-2.17.so)
==2595== 
==2595== This conflicts with a previous write of size 4 by thread #2
==2595== Locks held: none
==2595==    at 0x54FC787: mbedtls_ecp_grp_id_list (ecp.c:459)
==2595==    by 0x50A080A: mbedtls_ssl_config_defaults (ssl_tls.c:9205)
==2595==    by 0x4E5948F: SxSocket::initTLS(SxString const&, SxString const&, SxString const&) (SxSocket.cpp:200)
==2595==    by 0x4E5A52B: SxSocket::SxSocket(SxAddrInfo const&, SxString const&) (SxSocket.cpp:51)
==2595==    by 0x404CBA: create<SxAddrInfo, char [25]> (SxPtr.hpp:146)
==2595==    by 0x404CBA: EchoClientTask::main() (echoclient.cpp:101)
==2595==    by 0x59E56BB: SxThread::mainWrap() (SxThread.cpp:148)
==2595==    by 0x59B453C: SxThreadPool::PoolThread::main() (SxThreadPool.cpp:70)
==2595==    by 0x59E28E8: SxSystemThread::launcher() (SxSystemThread.cpp:118)
==2595==  Address 0x5744410 is 0 bytes inside data symbol "init_done.3821" 
==2595== 
==2595== ----------------------------------------------------------------

these SxSocket instances do just that: each reads in a cert file, a key file, sets up a DRBG and then uses these.

my mbedtls is build with MBEDTLS_THREADING_PTHREAD and I verified its linked properly using ldd.

Hi @omtayroom
Theoretically you can, but the main purpose of having an ssl configuration is to have it mutual for all TLS sessions.
I am not sure what entropy source you are using, but it is highly possible that this is the source of error, if eventually you have a single entropy source which is not thread safe.

yes, I am aware of the sharing. This concurrent, non-shared usage it not the typical use case but should be possible given the API we put on top.

my entropy source is this (copied together form the sources):

mbedtls_entropy_context  cryptEntropy = {};
mbedtls_ctr_drbg_context cryptRNG = {};

mbedtls_ctr_drbg_init (&cryptRNG);
mbedtls_entropy_init  (&cryptEntropy);
err = mbedtls_ctr_drbg_seed (&cryptRNG, mbedtls_entropy_func, &cryptEntropy, ...);
if (err != 0 ) { ... }

mbedtls_ssl_conf_rng (&tlsConf, mbedtls_ctr_drbg_random, &cryptRNG );

mbedtls_ctr_drbg_free   (&cryptRNG);
mbedtls_entropy_free    (&cryptEntropy);

This is done by all these SxSocket instances, which exist concurrently in the case of the thread test.

MBEDTLS_THREADING_C is also defined, verified with query_compile_time_config

this morning I’ve noticed another interesting helgrind message:

==2927== ----------------------------------------------------------------
==2927== 
==2927== Possible data race during write of size 8 at 0x57443D0 by thread #3
==2927== Locks held: none
==2927==    at 0x54FEAF6: ecp_add_mixed (ecp.c:1391)
==2927==    by 0x5501720: mbedtls_ecp_muladd_restartable (ecp.c:2552)
==2927==    by 0x54FC056: ecdsa_verify_restartable (ecdsa.c:560)
==2927==    by 0x54FC590: mbedtls_ecdsa_read_signature_restartable (ecdsa.c:778)
==2927==    by 0x54FC46A: mbedtls_ecdsa_read_signature (ecdsa.c:729)
==2927==    by 0x5513DFB: ecdsa_verify_wrap (pk_wrap.c:482)
==2927==    by 0x5513BFB: eckey_verify_wrap (pk_wrap.c:247)
==2927==    by 0x551322E: mbedtls_pk_verify_restartable (pk.c:281)
==2927==    by 0x5084DB4: ssl_parse_server_key_exchange (ssl_cli.c:2642)
==2927==    by 0x5086A39: mbedtls_ssl_handshake_client_step (ssl_cli.c:3563)
==2927==    by 0x509EC97: mbedtls_ssl_handshake_step (ssl_tls.c:8064)
==2927==    by 0x509ED27: mbedtls_ssl_handshake (ssl_tls.c:8088)
==2927== 
==2927== This conflicts with a previous write of size 8 by thread #2
==2927== Locks held: none
==2927==    at 0x54FF41F: ecp_randomize_jac (ecp.c:1474)
==2927==    by 0x54FFC63: ecp_mul_comb_core (ecp.c:1820)
==2927==    by 0x54FFF02: ecp_mul_comb_after_precomp (ecp.c:1931)
==2927==    by 0x55001AC: ecp_mul_comb (ecp.c:2080)
==2927==    by 0x55011AB: mbedtls_ecp_mul_restartable (ecp.c:2369)
==2927==    by 0x54FADE2: ecdh_compute_shared_restartable (ecdh.c:122)
==2927==    by 0x54FAE82: mbedtls_ecdh_compute_shared (ecdh.c:151)
==2927==    by 0x54FB5B6: ecdh_calc_secret_internal (ecdh.c:629)
==2927==  Address 0x57443d0 is 0 bytes inside data symbol "mul_count" 
==2927==

another one involving mbedtls_ssl_handshake() and mbedtls_x509_crt_parse():

==2927== Possible data race during read of size 8 at 0x57443D0 by thread #2
==2927== Locks held: none
==2927==    at 0x54FD561: ecp_normalize_jac (ecp.c:1106)
==2927==    by 0x54FFF4B: ecp_mul_comb_after_precomp (ecp.c:1942)
==2927==    by 0x55001AC: ecp_mul_comb (ecp.c:2080)
==2927==    by 0x55011AB: mbedtls_ecp_mul_restartable (ecp.c:2369)
==2927==    by 0x54FADE2: ecdh_compute_shared_restartable (ecdh.c:122)
==2927==    by 0x54FAE82: mbedtls_ecdh_compute_shared (ecdh.c:151)
==2927==    by 0x54FB5B6: ecdh_calc_secret_internal (ecdh.c:629)
==2927==    by 0x54FB686: mbedtls_ecdh_calc_secret (ecdh.c:661)
==2927==    by 0x5085931: ssl_write_client_key_exchange (ssl_cli.c:2991)
==2927==    by 0x5086A89: mbedtls_ssl_handshake_client_step (ssl_cli.c:3586)
==2927==    by 0x509EC97: mbedtls_ssl_handshake_step (ssl_tls.c:8064)
==2927==    by 0x509ED27: mbedtls_ssl_handshake (ssl_tls.c:8088)
==2927== 
==2927== This conflicts with a previous write of size 8 by thread #3
==2927== Locks held: none
==2927==    at 0x55012FE: ecp_check_pubkey_sw (ecp.c:2424)
==2927==    by 0x5501865: mbedtls_ecp_check_pubkey (ecp.c:2632)
==2927==    by 0x551615D: pk_get_ecpubkey (pkparse.c:508)
==2927==    by 0x55165D3: mbedtls_pk_parse_subpubkey (pkparse.c:664)
==2927==    by 0x52BA26C: x509_crt_parse_der_core (x509_crt.c:990)
==2927==    by 0x52BA5E6: mbedtls_x509_crt_parse_der (x509_crt.c:1121)
==2927==    by 0x52BA74D: mbedtls_x509_crt_parse (x509_crt.c:1218)
==2927==    by 0x52BA849: mbedtls_x509_crt_parse_file (x509_crt.c:1263)
==2927==  Address 0x57443d0 is 0 bytes inside data symbol "mul_count"

Hi @omtayroom
Are you using same mbedtls_ssl_context structure for all your threads?

Also,

my entropy source is this (copied together form the sources):

This is not your entropy source. This is your call to seed your drbg function, and set your drbg function in your tls configuration.
What is your entropy source, which is polled top get entropy?

Hello Ron,

no, each thread creates N SxSocket objects subsequently. Each SxSocket will init a new mbedtls_ssl_context at construction and free it at deconstruction.

Then I can’t tell what my entropy source is. I thought mbedtls_entropy_func() is what organizes entropy. How can I find out?

How can that account for a segfault? I understand that entropy sources block when there is no entropy? So I would expect threads to wait on it when its exhausted.

another question: is debugging thread safe?
I use mbedtls_ssl_conf_dbg() to setup my own debug callback function. Helgrind shows many warnings with mbedtls_debug_set_threshold() in both threads during SxSocket object construction. Later at runtime Helgrind shows warnings with mbedtls_debug_print_msg().

Hi @omtayroom

Then I can’t tell what my entropy source is. I thought mbedtls_entropy_func() is what organizes entropy. How can I find out?

Well. Yes. This function pretty much organizes entropy, but it is not the entropy source. It gathers entropy from all the sources in your platform, and conditions them into a single entropy.
On your platform, you should check what is configured in regards of entropy source, and what functions are called within mbedtls_entropy_init().
In addition, you should check whether you have explicit calls to mbedtls_entropy_add_source().

How can that account for a segfault? I understand that entropy sources block when there is no entropy? So I would expect threads to wait on it when its exhausted.

This is dependent on your implementation.

is debugging thread safe?

The default implementation of debugging is writing to stdout. If you have your own implementation of debug callback, which uses global variables, I would suggest you add thread safety to it.

Hello Ron,

if I spot it right, these two get called:

...
#if !defined(MBEDTLS_NO_DEFAULT_ENTROPY_SOURCES)
#if !defined(MBEDTLS_NO_PLATFORM_ENTROPY)
    mbedtls_entropy_add_source( ctx, mbedtls_platform_entropy_poll, NULL,
                                MBEDTLS_ENTROPY_MIN_PLATFORM,
                                MBEDTLS_ENTROPY_SOURCE_STRONG );
#endif
...
#if defined(MBEDTLS_TIMING_C)
    mbedtls_entropy_add_source( ctx, mbedtls_hardclock_poll, NULL,
                                MBEDTLS_ENTROPY_MIN_HARDCLOCK,
                                MBEDTLS_ENTROPY_SOURCE_WEAK );
#endif

I have never seen any of the error codes mbedtls_platform_entropy_poll() can return. The mbedtls_hardclock_poll() I dont understand.

there are none.

Hi @omtayroom
Thank you for your information
It could be that the crash happens within your platform entropy polling, whether it is in you getrandom call, or the fopen("/dev/urandom") call. However it’s unlikely as these funcotion should be thread safe.

The helgrind log indicate that you have some sort of shared resource, probably the public key, between the resources, since all the logs indicate issues related to the public key, whether in its parsing or in the ecdh process.
Have you checked you have enough memory in your heap to allocate all the memory needed for the mutliple handshakes? Perhaps your allocation doesn’t return NULL when Out Of Memory?
Regards

Hello Ron,

I think so too.

The public key is kept in a internal (protected) variable of each SxSocket instance, so its not shared, I think.

However, what happens in this test is: each instance reads the key from the same file, so each instance ends up with the very same key. Is there, perhaps, some optimization crossing in here?

just checked my heap consumption with the massif tool (also part of valgrind) and it looks as expected to me. Ie. the peak snapshot happens early at runtime (snapshot 23 of 71, 3 of 77, …), not towards the end. Total consumption remains almost constant after allocations build up in the beginning. After ~1000 iterations passed I see no increase right up to the segfault.

Also I have never seen bad_alloc exceptions or dmesg entries indicating out of memory.

Hi @omtayroom

However, what happens in this test is: each instance reads the key from the same file, so each instance ends up with the very same key. Is there, perhaps, some optimization crossing in here?

It doesn’t matter you are using same key, if you are parsing to a different context. However, it might be that your platform has some reference counter in your file functionality, and this causes issues.

meanwhile I could confirm that the segfault is triggered by the way my code stores the sockets (and the mbedtls contexts along with it).

“same key” is also ruled out.

I’ll update when I know the actual reason.

Hello Ron,

here is the mbedtls_ssl_context tlsCtx data passed in mbedtls_ssl_handshake(&tlsCtx). Its from the debugger and should be the call that triggers the segfault:

{
 conf = 0x7fb6a0095120,
 state = 2,
 renego_status = 0,
 renego_records_seen = 0,
 major_ver = 3,
 minor_ver = 1,
 badmac_seen = 0,
 f_send = 0x7fb6a9fa2870 ,
 f_recv = 0x0,
 f_recv_timeout = 0x7fb6a9fa2770 ,
 p_bio = 0x7fb6a0003400,
 session_in = 0x0,
 session_out = 0x0,
 session = 0x0,
 session_negotiate = 0x7fb6a000c830,
 handshake = 0x7fb6a000c8d0,
 transform_in = 0x0,
 transform_out = 0x0,
 transform = 0x0,
 transform_negotiate = 0x7fb6a000c6f0,
 p_timer = 0x0,
 f_set_timer = 0x0,
 f_get_timer = 0x0,
 in_buf = 0x7fb6a0004430 "",
 in_ctr = 0x7fb6a0004430 "",
 in_hdr = 0x7fb6a0004438 "\026\003\003",
 in_len = 0x7fb6a000443b "",
 in_iv = 0x7fb6a000443d "",
 in_msg = 0x7fb6a000443d "",
 in_offt = 0x0,
 in_msgtype = 0,
 in_msglen = 0,
 in_left = 0,
 in_epoch = 0,
 next_record_offset = 0,
 in_window_top = 0,
 in_window = 0,
 in_hslen = 0,
 nb_zero = 0,
 keep_current_message = 0,
 disable_datagram_packing = 0 '\000',
 out_buf = 0x7fb6a0008590 "",
 out_ctr = 0x7fb6a0008590 "",
 out_hdr = 0x7fb6a0008598 "\026\003\001\001u\001",
 out_len = 0x7fb6a000859b "\001u\001",
 out_iv = 0x7fb6a000859d "\001",
 out_msg = 0x7fb6a000859d "\001",
 out_msgtype = 22,
 out_msglen = 373,
 out_left = 0,
 cur_out_ctr = "\000\000\000\000\000\000\000\001",
 mtu = 0,
 split_done = 0 '\000',
 client_auth = 0,
 hostname = 0x0,
 alpn_chosen = 0x0,
 cli_id = 0x0,
 cli_id_len = 0,
 secure_renegotiation = 0,
 verify_data_len = 0,
 own_verify_data = '\000' ,
 peer_verify_data = '\000' 
}

can you spot anything odd?

mbedtls_ssl_read_record () and mbedtls_ssl_fetch_input () are at the bottom of the backtrace.

The mbedtls_net_context netCtx.fd has a value of 1037 when I see it last (passed to mbedtls_ssl_set_bio() right before the call to mbedtls_ssl_handshake(&tlsCtx).

Unfortunately, I can’t see at the moment something strange. Perhaps there is some data corruption, that overrides some of the pointers.
As for the fd value of 1037. Is this reasonable in our system? How many open file descriptor can be open in one single moment in your system?
Regards

I think I am able to open much more:

$ ulimit -n
1048576

at each a single moment even the 1037 should not be in use, because I am running the test with two threads only. I think it just goes up that much because the [linux] kernel does not immediately re-issue the numbers.

(I see FD number re-using when I run 20000 iterations with a single thread - then there is lots of time between the uses)