Mbedtls_ssl_handshake() segfault after ~1000 iterations

Hello,

I have a class EchoClient which essentially wraps mbedtls into a encryption enabled client object thing.

Then in main() I create such objects in a loop and have them send+receive a string. This is to stress it (and for testing thread behavior later on).

So far it does what it should. But after about 1000 iterations I get this segfault:

#7  0x00000000004015f5 in main () at t1792_handshake_segfault_client.cpp:183
#6  0x000000000040185f in EchoClient::echo (this=0x7ffe9eb61c00) at t1792_handshake_segfault_client.cpp:46
#5  0x00007f09179e2870 in mbedtls_ssl_handshake () from /usr/local/lib/libmbedtls.so.12
#4  0x00007f09179e280d in mbedtls_ssl_handshake_step () from /usr/local/lib/libmbedtls.so.12
#3  0x00007f09179d7388 in mbedtls_ssl_handshake_client_step () from /usr/local/lib/libmbedtls.so.12
#2  0x00007f09179d5324 in ssl_parse_server_hello () from /usr/local/lib/libmbedtls.so.12
#1  0x00007f09179e718e in mbedtls_ssl_read_record () from /usr/local/lib/libmbedtls.so.12
#0  0x00007f09179e5dee in mbedtls_ssl_fetch_input () from /usr/local/lib/libmbedtls.so.12

with mbedtls_debug_set_threshold (2) the last echo (iteration 1021 in this case) it says this:

--> task 1021
ssl_tls.c:8084: 0x7fff9b1abb98: => handshake
ssl_cli.c:3510: 0x7fff9b1abb98: client state: 0
ssl_tls.c:2755: 0x7fff9b1abb98: => flush output
ssl_tls.c:2767: 0x7fff9b1abb98: <= flush output
ssl_cli.c:3510: 0x7fff9b1abb98: client state: 1
ssl_tls.c:2755: 0x7fff9b1abb98: => flush output
ssl_tls.c:2767: 0x7fff9b1abb98: <= flush output
ssl_cli.c:0774: 0x7fff9b1abb98: => write client hello
ssl_tls.c:3184: 0x7fff9b1abb98: => write handshake message
ssl_tls.c:3343: 0x7fff9b1abb98: => write record
ssl_tls.c:2755: 0x7fff9b1abb98: => flush output
ssl_tls.c:2774: 0x7fff9b1abb98: message length: 378, out_left: 378
ssl_tls.c:2779: 0x7fff9b1abb98: ssl->f_send() returned 378 (-0xfffffe86)
ssl_tls.c:2807: 0x7fff9b1abb98: <= flush output
ssl_tls.c:3476: 0x7fff9b1abb98: <= write record
ssl_tls.c:3320: 0x7fff9b1abb98: <= write handshake message
ssl_cli.c:1106: 0x7fff9b1abb98: <= write client hello
ssl_cli.c:3510: 0x7fff9b1abb98: client state: 2
ssl_tls.c:2755: 0x7fff9b1abb98: => flush output
ssl_tls.c:2767: 0x7fff9b1abb98: <= flush output
ssl_cli.c:1499: 0x7fff9b1abb98: => parse server hello
ssl_tls.c:4311: 0x7fff9b1abb98: => read record
ssl_tls.c:2536: 0x7fff9b1abb98: => fetch input
ssl_tls.c:2697: 0x7fff9b1abb98: in_left: 0, nb_want: 5
Segmentation fault (core dumped)

Now I am lost on how to figure what the actual cause is?

Must I check something on the socket before calling mbedtls_ssl_read()?

Here is the EchoClient:

// g++ -g -Wall -o t1792_handshake_segfault_client -lmbedtls -lmbedcrypto -l mbedx509 t1792_handshake_segfault_client.cpp

#include "mbedtls/config.h"
#include "mbedtls/platform.h"
#include "mbedtls/net_sockets.h"
#include "mbedtls/error.h"
#include "mbedtls/debug.h"
#include "mbedtls/entropy.h"
#include "mbedtls/ctr_drbg.h"

#include <string.h>

#define PAYLOAD "echoclientstring\n"

class EchoClient
{
 public :

   EchoClient () {  init(); };

   ~EchoClient () {
      mbedtls_ssl_close_notify(&tlsCtx);
      mbedtls_ssl_free        (&tlsCtx);
      mbedtls_x509_crt_free   (&tlsCert);
      mbedtls_pk_free         (&tlsKey);
      mbedtls_ctr_drbg_free   (&cryptRNG);
      mbedtls_entropy_free    (&cryptEntropy);
      mbedtls_ssl_config_free (&tlsConf);
   }

   void echo() {
      // --- connect
      err = mbedtls_net_connect (&netCtx, "localhost", "12345", MBEDTLS_NET_PROTO_TCP);
      if (err != 0 ) {
         mbedtls_strerror(err, error_buf, sizeof(error_buf));
         mbedtls_printf("Connecting failed: %d - %s\n\n", err, error_buf );
         return;
      }

      // --- handshake
      mbedtls_ssl_set_bio (&tlsCtx, (void*)&netCtx,
                           mbedtls_net_send, mbedtls_net_recv,
                           mbedtls_net_recv_timeout);
      err = mbedtls_ssl_handshake (&tlsCtx);
      if (err != 0 ) {
         mbedtls_strerror(err, error_buf, sizeof(error_buf));
         mbedtls_printf("Handshake failed: %d - %s\n\n", err, error_buf );
         return;
      }

      // --- send
      mbedtls_printf( "sending  ");
      int len = sprintf( (char *) buf, PAYLOAD);
      while( (err = mbedtls_ssl_write( &tlsCtx, buf, len )) <= 0 )
      {
          if( err != MBEDTLS_ERR_SSL_WANT_READ && err != MBEDTLS_ERR_SSL_WANT_WRITE)
          {
              mbedtls_printf( "  ! mbedtls_ssl_write returned %d\n\n", err);
              return;
          }
      }
      len = err;
      mbedtls_printf( " %d bytes:\n%s\n--\n", len, (char *)buf );

      // --- receive
      mbedtls_printf( "receiving");
      len = sizeof( buf ) - 1;
      memset( buf, 0, sizeof( buf ) );
      err = mbedtls_ssl_read( &tlsCtx, buf, len );
      if( err < 0 )  {
          mbedtls_printf( "failed\n  ! mbedtls_ssl_read returned %d\n\n", err );
         return;
      }

      if( err == 0 ) {
          mbedtls_printf( "\n\nEOF\n\n" );
      }

      len = err;
      mbedtls_printf( " %d bytes:\n%s\n--\n", len, (char *)buf );

      if (strcmp ((char *)buf, PAYLOAD) != 0) {
         mbedtls_printf( "\nresponse does not match payload: >%s< != >%s<\n\n", (char *)buf, PAYLOAD);
         return;
      } else {
         mbedtls_printf( "echo SUCCESS\n\n");
      }
   } // echo


 protected :
   int err = 0;
   unsigned char buf[1024];
   char error_buf[100];

   mbedtls_net_context        netCtx  = {};
   mbedtls_ssl_config         tlsConf = {};
   mbedtls_ssl_context        tlsCtx  = {};
   mbedtls_x509_crt           tlsCert = {};
   mbedtls_pk_context         tlsKey  = {};
   mbedtls_entropy_context    cryptEntropy = {};
   mbedtls_ctr_drbg_context   cryptRNG = {};

   void init() {
      mbedtls_net_init (&netCtx);
      mbedtls_ssl_init (&tlsCtx);
      mbedtls_x509_crt_init (&tlsCert);

      // --- load cert(s)
      err = mbedtls_x509_crt_parse_file (&tlsCert, "echoclient.certchain.pem");
      if (err != 0 ) {
         mbedtls_strerror(err, error_buf, sizeof(error_buf));
         mbedtls_printf("Load cert failed: %d - %s\n\n", err, error_buf );
         return;
      }

      mbedtls_pk_init (&tlsKey);

      // --- seed number generator
      mbedtls_ctr_drbg_init (&cryptRNG);
      mbedtls_entropy_init  (&cryptEntropy);
      
      const char *pers = "t1792 client";
      err = mbedtls_ctr_drbg_seed (&cryptRNG, mbedtls_entropy_func, &cryptEntropy,
                                   (const unsigned char *) pers, strlen(pers) );
      if (err != 0 ) {
         mbedtls_strerror(err, error_buf, sizeof(error_buf));
         mbedtls_printf("Seed RNG failed: %d - %s\n\n", err, error_buf );
         return;
      }

      // --- setup TLS facility
      mbedtls_ssl_config_init (&tlsConf);
      err = mbedtls_ssl_config_defaults (&tlsConf,
                                         MBEDTLS_SSL_IS_CLIENT,
                                         MBEDTLS_SSL_TRANSPORT_STREAM,
                                         MBEDTLS_SSL_PRESET_DEFAULT);
      if (err != 0 ) {
         mbedtls_strerror(err, error_buf, sizeof(error_buf));
         mbedtls_printf("Setting tlsConf failed: %d - %s\n\n", err, error_buf );
         return;
      }
      mbedtls_ssl_conf_rng (&tlsConf, mbedtls_ctr_drbg_random, &cryptRNG );

      mbedtls_debug_set_threshold (2);
      mbedtls_ssl_conf_dbg (&tlsConf, my_debug, stdout);

      mbedtls_ssl_conf_ca_chain (&tlsConf, tlsCert.next, NULL);
      err = mbedtls_ssl_conf_own_cert (&tlsConf, &tlsCert, &tlsKey);
      if (err != 0 ) {
         mbedtls_strerror(err, error_buf, sizeof(error_buf));
         mbedtls_printf("Setting CA Chain failed: %d - %s\n\n", err, error_buf );
         return;
      }
      err = mbedtls_ssl_setup (&tlsCtx, &tlsConf);
      if (err != 0 ) {
         mbedtls_strerror(err, error_buf, sizeof(error_buf));
         mbedtls_printf("SSL setup failed: %d - %s\n\n", err, error_buf );
         return;
      }
   } // init


   static void my_debug( void *ctx, int level,
                         const char *file, int line,
                         const char *str )
   {
       ((void) level);
       mbedtls_fprintf( (FILE *) ctx, "%s:%04d: %s", file, line, str );
       fflush(  (FILE *) ctx  );
   }

}; // class EchoClient


int main() {

   for(int t=0; t<10000; ++t) {
      mbedtls_printf("--> task %i\n", t);
      EchoClient e = EchoClient();
      e.echo();
   }

   return MBEDTLS_EXIT_SUCCESS;
}

the echoing server is this:

ncat -l 12345 -k -c 'xargs -n1 echo' --ssl --ssl-cert echo.cert.pem --ssl-key echoserver.key.pem  -v

this was done with mbedtls-2.16.2 from git.

Hi @omtayroom
Thank you for your question!
Does the segmentation fault always happen in the mbedtls_ssl_fetch_input() function?
Are you using default configuration?
It sounds to me that there is some memory leak in your application.
Have you used some memory analyzing tool?
Where are you calling mbedtls_net_free()?
Have you changed the default value of the incoming and outgoing content buffer size?
Regards,
Mbed TLS Team member
Ron

Hello Ron,

thanks for your questions.

yes, as far as I observed.

I actually have another way of triggering a crash in mbedtls_ssl_handshake() on the server side, which I think, also comes down to mbedtls_ssl_fetch_input().

Its triggered by loading the client with a self-signed certificate and saying mbedtls_ssl_conf_authmode (&tlsConf, MBEDTLS_SSL_VERIFY_OPTIONAL) to make it go. Then, when connecting, the client errors out with

TLS DEBUG ssl_tls.c:5811: 0x7f355c002c88: got no CA chain
SSL - No CA Chain is set, but required to operate

while the server segfaults very similiar to what I show here.

(I failed to prepare a server example so far)

yes, with one modification: MBEDTLS_THREADING_C and MBEDTLS_THREADING_PTHREAD are defined as I plan to use this later.

when running 1000 iterations to avoid a crash, valgrind says:

--27609-- Discarding syms at 0x6548130-0x654f481 in /usr/lib64/libnss_files-2.17.so due to munmap()
==27609== 
==27609== HEAP SUMMARY:
==27609==     in use at exit: 0 bytes in 0 blocks
==27609==   total heap usage: 21,400,106 allocs, 21,400,106 frees, 2,533,134,899 bytes allocated
==27609== 
==27609== All heap blocks were freed -- no leaks are possible
==27609== 
==27609== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
==27609== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

ups, forgot that in the example .

When I add it to the destructor, after mbedtls_ssl_close_notify(), it COMPLETES the 10000 iterations.

Nice, thanks. It indicates that my close () function in the production code has some problem. Let me see.

Also I see that the 1020 iterations I observed most of the time probably reflect the default limit on open files on my user / system: its 1024. Strangely the production code runs 1050…1100 iterations until segfault? …

But why does it crash? I would expect some can’t open FD message when that limit is the problem?

No.

Hi @omtayroom

Nice, thanks. It indicates that my close () function in the production code has some problem. Let me see.

Is the issue resolved? I believe the root cause was memory leak, as mentioned earlier.

But why does it crash? I would expect some can’t open FD message when that limit is the problem?

The networking module is a reference module for BSD like sockets. The same reason why you wouldn’t allocate memory to a pointer without freeing it first, you should call mbedtls_net_free() before calling mbedtls_net_connect(). Perhaps documentation should be clearer.
Regards

it looks like solved.

I added mbedtls_net_free() whenever I create a mbedtls_net_context, right after declaring it. Correct?

This makes the 10000 iterations pass with the above example as well as with my production code.

I am not yet sure about my server. Will try to produce a example if I can repeat that one.

Nope, the server can still be crashed by loading a cert without chain into the client. The backtrace looks very similar:

#3  0x00007fb325fe5870 in mbedtls_ssl_handshake () from /usr/local/lib/libmbedtls.so.12
#2  0x00007fb325fdffc8 in mbedtls_ssl_handshake_server_step () from /usr/local/lib/libmbedtls.so.12
#1  0x00007fb325fdcf85 in ssl_parse_client_hello () from /usr/local/lib/libmbedtls.so.12
#0  0x00007fb325fe8dee in mbedtls_ssl_fetch_input () from /usr/local/lib/libmbedtls.so.12

(i have not example code yet)

In my thread test - that is when I have 2 threads creating such EchoClient objects - it still crashes after 1000-1200 objects were created. The number varies. Three runs gave me: 1063, 1177, 1099 objects created+destroyed before segfault.

Its always in mbedtls_ssl_fetch_input().

(i need to adjust my example to show that here)

Hi,
It sounds to me like a concurrency issue, do you protect your data with mutexes?

yes, its mutexed.

I am working on a example without our own library , to rule that out. (it handles the threads and mutexes).

So far I fail to reproduce my production crash with a simplified example.

I will go another route: looking into mbedtls itself using the debugger to see precisely where mbedtls_ssl_fetch_input() crashes and on which value(s), the FD in particular.

(I could not do this yet, because I am linked against a general purpose build installed in the OS - it needs to be added into my apps build system first, which is automake)

now I have a mbedTLS build made with this:

cmake3 -D CMAKE_BUILD_TYPE=Debug -DCMAKE_INSTALL_PREFIX=${MBEDTLS_PATH}/dbg -DUSE_SHARED_MBEDTLS_LIBRARY=On -DENABLE_PROGRAMS=Off ../mbedtls_src

interestingly it does not segfault with that.

After 1019 iterations , however, my program hangs. The 2 threads stop finishing their tasks, so main() waits on the list of tasks.

When that happens I seen in /proc/<pid>/fd/ that my file descriptors are exhaused.

These file descriptors are of type anon_inode:[eventpoll]. They are not socket:[XYZ] which mbedtls_net_free() closes, right?

That closing part works then and I need to loop why epoll is not cleaning up.

Hello Ron,

can I have two SSL configurations in one process? (+ two x509 certs + two private keys + two DRBGs with entropy for each)

the reason I ask this comes from looking at race conditions using the Helgrind tool. It says things like this:

==2595== ----------------------------------------------------------------
==2595== 
==2595== Possible data race during read of size 4 at 0x5744410 by thread #3
==2595== Locks held: none
==2595==    at 0x54FC71D: mbedtls_ecp_grp_id_list (ecp.c:446)
==2595==    by 0x50A080A: mbedtls_ssl_config_defaults (ssl_tls.c:9205)
==2595==    by 0x4E5948F: SxSocket::initTLS(SxString const&, SxString const&, SxString const&) (SxSocket.cpp:200)
==2595==    by 0x4E5A52B: SxSocket::SxSocket(SxAddrInfo const&, SxString const&) (SxSocket.cpp:51)
==2595==    by 0x404CBA: create<SxAddrInfo, char [25]> (SxPtr.hpp:146)
==2595==    by 0x404CBA: EchoClientTask::main() (echoclient.cpp:101)
==2595==    by 0x59E56BB: SxThread::mainWrap() (SxThread.cpp:148)
==2595==    by 0x59B453C: SxThreadPool::PoolThread::main() (SxThreadPool.cpp:70)
==2595==    by 0x59E28E8: SxSystemThread::launcher() (SxSystemThread.cpp:118)
==2595==    by 0x59E2B7E: SxSystemThread::loopCallback(void*) (SxSystemThread.cpp:179)
==2595==    by 0x4C3081E: mythread_wrapper (hg_intercepts.c:389)
==2595==    by 0x66A1DD4: start_thread (in /usr/lib64/libpthread-2.17.so)
==2595==    by 0x6ECCEAC: clone (in /usr/lib64/libc-2.17.so)
==2595== 
==2595== This conflicts with a previous write of size 4 by thread #2
==2595== Locks held: none
==2595==    at 0x54FC787: mbedtls_ecp_grp_id_list (ecp.c:459)
==2595==    by 0x50A080A: mbedtls_ssl_config_defaults (ssl_tls.c:9205)
==2595==    by 0x4E5948F: SxSocket::initTLS(SxString const&, SxString const&, SxString const&) (SxSocket.cpp:200)
==2595==    by 0x4E5A52B: SxSocket::SxSocket(SxAddrInfo const&, SxString const&) (SxSocket.cpp:51)
==2595==    by 0x404CBA: create<SxAddrInfo, char [25]> (SxPtr.hpp:146)
==2595==    by 0x404CBA: EchoClientTask::main() (echoclient.cpp:101)
==2595==    by 0x59E56BB: SxThread::mainWrap() (SxThread.cpp:148)
==2595==    by 0x59B453C: SxThreadPool::PoolThread::main() (SxThreadPool.cpp:70)
==2595==    by 0x59E28E8: SxSystemThread::launcher() (SxSystemThread.cpp:118)
==2595==  Address 0x5744410 is 0 bytes inside data symbol "init_done.3821" 
==2595== 
==2595== ----------------------------------------------------------------

these SxSocket instances do just that: each reads in a cert file, a key file, sets up a DRBG and then uses these.

my mbedtls is build with MBEDTLS_THREADING_PTHREAD and I verified its linked properly using ldd.

Hi @omtayroom
Theoretically you can, but the main purpose of having an ssl configuration is to have it mutual for all TLS sessions.
I am not sure what entropy source you are using, but it is highly possible that this is the source of error, if eventually you have a single entropy source which is not thread safe.

yes, I am aware of the sharing. This concurrent, non-shared usage it not the typical use case but should be possible given the API we put on top.

my entropy source is this (copied together form the sources):

mbedtls_entropy_context  cryptEntropy = {};
mbedtls_ctr_drbg_context cryptRNG = {};

mbedtls_ctr_drbg_init (&cryptRNG);
mbedtls_entropy_init  (&cryptEntropy);
err = mbedtls_ctr_drbg_seed (&cryptRNG, mbedtls_entropy_func, &cryptEntropy, ...);
if (err != 0 ) { ... }

mbedtls_ssl_conf_rng (&tlsConf, mbedtls_ctr_drbg_random, &cryptRNG );

mbedtls_ctr_drbg_free   (&cryptRNG);
mbedtls_entropy_free    (&cryptEntropy);

This is done by all these SxSocket instances, which exist concurrently in the case of the thread test.

MBEDTLS_THREADING_C is also defined, verified with query_compile_time_config

this morning I’ve noticed another interesting helgrind message:

==2927== ----------------------------------------------------------------
==2927== 
==2927== Possible data race during write of size 8 at 0x57443D0 by thread #3
==2927== Locks held: none
==2927==    at 0x54FEAF6: ecp_add_mixed (ecp.c:1391)
==2927==    by 0x5501720: mbedtls_ecp_muladd_restartable (ecp.c:2552)
==2927==    by 0x54FC056: ecdsa_verify_restartable (ecdsa.c:560)
==2927==    by 0x54FC590: mbedtls_ecdsa_read_signature_restartable (ecdsa.c:778)
==2927==    by 0x54FC46A: mbedtls_ecdsa_read_signature (ecdsa.c:729)
==2927==    by 0x5513DFB: ecdsa_verify_wrap (pk_wrap.c:482)
==2927==    by 0x5513BFB: eckey_verify_wrap (pk_wrap.c:247)
==2927==    by 0x551322E: mbedtls_pk_verify_restartable (pk.c:281)
==2927==    by 0x5084DB4: ssl_parse_server_key_exchange (ssl_cli.c:2642)
==2927==    by 0x5086A39: mbedtls_ssl_handshake_client_step (ssl_cli.c:3563)
==2927==    by 0x509EC97: mbedtls_ssl_handshake_step (ssl_tls.c:8064)
==2927==    by 0x509ED27: mbedtls_ssl_handshake (ssl_tls.c:8088)
==2927== 
==2927== This conflicts with a previous write of size 8 by thread #2
==2927== Locks held: none
==2927==    at 0x54FF41F: ecp_randomize_jac (ecp.c:1474)
==2927==    by 0x54FFC63: ecp_mul_comb_core (ecp.c:1820)
==2927==    by 0x54FFF02: ecp_mul_comb_after_precomp (ecp.c:1931)
==2927==    by 0x55001AC: ecp_mul_comb (ecp.c:2080)
==2927==    by 0x55011AB: mbedtls_ecp_mul_restartable (ecp.c:2369)
==2927==    by 0x54FADE2: ecdh_compute_shared_restartable (ecdh.c:122)
==2927==    by 0x54FAE82: mbedtls_ecdh_compute_shared (ecdh.c:151)
==2927==    by 0x54FB5B6: ecdh_calc_secret_internal (ecdh.c:629)
==2927==  Address 0x57443d0 is 0 bytes inside data symbol "mul_count" 
==2927==

another one involving mbedtls_ssl_handshake() and mbedtls_x509_crt_parse():

==2927== Possible data race during read of size 8 at 0x57443D0 by thread #2
==2927== Locks held: none
==2927==    at 0x54FD561: ecp_normalize_jac (ecp.c:1106)
==2927==    by 0x54FFF4B: ecp_mul_comb_after_precomp (ecp.c:1942)
==2927==    by 0x55001AC: ecp_mul_comb (ecp.c:2080)
==2927==    by 0x55011AB: mbedtls_ecp_mul_restartable (ecp.c:2369)
==2927==    by 0x54FADE2: ecdh_compute_shared_restartable (ecdh.c:122)
==2927==    by 0x54FAE82: mbedtls_ecdh_compute_shared (ecdh.c:151)
==2927==    by 0x54FB5B6: ecdh_calc_secret_internal (ecdh.c:629)
==2927==    by 0x54FB686: mbedtls_ecdh_calc_secret (ecdh.c:661)
==2927==    by 0x5085931: ssl_write_client_key_exchange (ssl_cli.c:2991)
==2927==    by 0x5086A89: mbedtls_ssl_handshake_client_step (ssl_cli.c:3586)
==2927==    by 0x509EC97: mbedtls_ssl_handshake_step (ssl_tls.c:8064)
==2927==    by 0x509ED27: mbedtls_ssl_handshake (ssl_tls.c:8088)
==2927== 
==2927== This conflicts with a previous write of size 8 by thread #3
==2927== Locks held: none
==2927==    at 0x55012FE: ecp_check_pubkey_sw (ecp.c:2424)
==2927==    by 0x5501865: mbedtls_ecp_check_pubkey (ecp.c:2632)
==2927==    by 0x551615D: pk_get_ecpubkey (pkparse.c:508)
==2927==    by 0x55165D3: mbedtls_pk_parse_subpubkey (pkparse.c:664)
==2927==    by 0x52BA26C: x509_crt_parse_der_core (x509_crt.c:990)
==2927==    by 0x52BA5E6: mbedtls_x509_crt_parse_der (x509_crt.c:1121)
==2927==    by 0x52BA74D: mbedtls_x509_crt_parse (x509_crt.c:1218)
==2927==    by 0x52BA849: mbedtls_x509_crt_parse_file (x509_crt.c:1263)
==2927==  Address 0x57443d0 is 0 bytes inside data symbol "mul_count"

Hi @omtayroom
Are you using same mbedtls_ssl_context structure for all your threads?

Also,

my entropy source is this (copied together form the sources):

This is not your entropy source. This is your call to seed your drbg function, and set your drbg function in your tls configuration.
What is your entropy source, which is polled top get entropy?

Hello Ron,

no, each thread creates N SxSocket objects subsequently. Each SxSocket will init a new mbedtls_ssl_context at construction and free it at deconstruction.

Then I can’t tell what my entropy source is. I thought mbedtls_entropy_func() is what organizes entropy. How can I find out?

How can that account for a segfault? I understand that entropy sources block when there is no entropy? So I would expect threads to wait on it when its exhausted.

another question: is debugging thread safe?
I use mbedtls_ssl_conf_dbg() to setup my own debug callback function. Helgrind shows many warnings with mbedtls_debug_set_threshold() in both threads during SxSocket object construction. Later at runtime Helgrind shows warnings with mbedtls_debug_print_msg().

Hi @omtayroom

Then I can’t tell what my entropy source is. I thought mbedtls_entropy_func() is what organizes entropy. How can I find out?

Well. Yes. This function pretty much organizes entropy, but it is not the entropy source. It gathers entropy from all the sources in your platform, and conditions them into a single entropy.
On your platform, you should check what is configured in regards of entropy source, and what functions are called within mbedtls_entropy_init().
In addition, you should check whether you have explicit calls to mbedtls_entropy_add_source().

How can that account for a segfault? I understand that entropy sources block when there is no entropy? So I would expect threads to wait on it when its exhausted.

This is dependent on your implementation.

is debugging thread safe?

The default implementation of debugging is writing to stdout. If you have your own implementation of debug callback, which uses global variables, I would suggest you add thread safety to it.

Hello Ron,

if I spot it right, these two get called:

...
#if !defined(MBEDTLS_NO_DEFAULT_ENTROPY_SOURCES)
#if !defined(MBEDTLS_NO_PLATFORM_ENTROPY)
    mbedtls_entropy_add_source( ctx, mbedtls_platform_entropy_poll, NULL,
                                MBEDTLS_ENTROPY_MIN_PLATFORM,
                                MBEDTLS_ENTROPY_SOURCE_STRONG );
#endif
...
#if defined(MBEDTLS_TIMING_C)
    mbedtls_entropy_add_source( ctx, mbedtls_hardclock_poll, NULL,
                                MBEDTLS_ENTROPY_MIN_HARDCLOCK,
                                MBEDTLS_ENTROPY_SOURCE_WEAK );
#endif

I have never seen any of the error codes mbedtls_platform_entropy_poll() can return. The mbedtls_hardclock_poll() I dont understand.

there are none.

Hi @omtayroom
Thank you for your information
It could be that the crash happens within your platform entropy polling, whether it is in you getrandom call, or the fopen("/dev/urandom") call. However it’s unlikely as these funcotion should be thread safe.

The helgrind log indicate that you have some sort of shared resource, probably the public key, between the resources, since all the logs indicate issues related to the public key, whether in its parsing or in the ecdh process.
Have you checked you have enough memory in your heap to allocate all the memory needed for the mutliple handshakes? Perhaps your allocation doesn’t return NULL when Out Of Memory?
Regards

Hello Ron,

I think so too.

The public key is kept in a internal (protected) variable of each SxSocket instance, so its not shared, I think.

However, what happens in this test is: each instance reads the key from the same file, so each instance ends up with the very same key. Is there, perhaps, some optimization crossing in here?

just checked my heap consumption with the massif tool (also part of valgrind) and it looks as expected to me. Ie. the peak snapshot happens early at runtime (snapshot 23 of 71, 3 of 77, …), not towards the end. Total consumption remains almost constant after allocations build up in the beginning. After ~1000 iterations passed I see no increase right up to the segfault.

Also I have never seen bad_alloc exceptions or dmesg entries indicating out of memory.

Hi @omtayroom

However, what happens in this test is: each instance reads the key from the same file, so each instance ends up with the very same key. Is there, perhaps, some optimization crossing in here?

It doesn’t matter you are using same key, if you are parsing to a different context. However, it might be that your platform has some reference counter in your file functionality, and this causes issues.