Advice for handling corruption of existing values when setting large values of other keys in KVStore

Non-free code so I can’t post it.

I’m using the global kvstore, run kv_init_storage_config() and all that.

I’m calling the functions that write global variables using an event queue thread with a large stack for various memory reasons. The RTOS thread that decides to write, has too small of a stack to actually write.

Thread kvstoreThread(osPriorityHigh, 8192);

kvstoreThread.start(callback(&global_kvstore_queue, &EventQueue::dispatch_forever));

global_kvstore_queue.call(kvstore_global_aws_rootca_read);

Also functions for other global vars and of course _write functions. The event queue works fine.

Lets say that via Pelion and its API I store a 11 byte string, “hello world”, then write that into the global kvstore. It works. I can read that value out of the kvstore after a reboot. Upon reboot, pelion has no problem reading its 41 data keys and re-connecting and all that. I’ve done this “many” times with no kvstore corruption happening, its 100% reliable.

However, sometimes I can write a couple long strings, sometimes it’ll fail after one write.

Lets say that via Pelion and its API, I store a 1220 byte TLS certificate in a string then write that with a key into the global kvstore. It works, no errors. After a reboot and/or power cycle its important to note that my new key is readable and error free and MY data is not lost. Sometimes I can store a couple K of data in the global KVStore, sometimes a write larger than a couple hundred bytes immediately damages other keys.

I wrapped my kv_set with iterators before and after the kv_set runs. The iterator prints out the names of the keys. Before writing a 1220 byte value, I had 41 keys in the global kvstore from Pelion all with names like “pelion_whatever”, and my 5 application keys for a total of 46 keys total in the kvstore. I believe the complied limit is 64 keys, no problem. The list of 46 keys looks identical before and after the kv_set is executed. The kv_set did not return an error code, I got a zero from it which is good. The kv_set looks the same no matter if I write short or long strings.

I can write a large number of small strings, reboot, and the pelion keys are fine. If I write a small number of large strings, reboot, it seems some existing keys got clobbered and I get errors from Pelion along the lines of fcc_developer_flow() failed with 25 and client_error(16)-> failed to read credentials from storage. Multiple reboots or power cycles have no change. The only way to recover is re-flash and Pelion will onboard it as another device.

Note that the data I saved can be read successfully and error free after a reboot or power cycle, its as if its’ damaging other older keys in the kvstore.

I’ve played around with increasing stack sizes for my rtos threads until I don’t have enough heap for pelion to run, no change. Or cutting back my stack sizes until I run out of stack space but obviously have plenty of heap, no change. The rate and type of failure seems independent of memory allocation. Although the failure depends on the length of the value stored in the kvstore, reallocating memory seems to have no effect on the failure.

There is an element of randomness where sometimes I can write 3, even 4 long strings before existing Pelion keys get corrupted and the next reboot fails. I never lose my data; its always other existing keys in the KVStore that get corrupted. Almost always it does fail immediately on the first write when storing large values.

Lets say that via Pelion and its API I store a 1220 byte TLS certificate in a global string var, then write that into the global kvstore. Writing that long string works fine.

This is on a DISCO_L475VG_IOT01A, and mbed studio is up to date.

I’m just curious what troubleshooting strategies other people have pursued.

Perhaps there is a lack in the documentation and I’m doing something wrong, although everything almost seems to work.

I wonder if there’s an unpublished limit to the length of kvstore values. That’s a pity if they can’t be longer than 1K without corrupting other data.

Could there be a conflict where writing a kvstore in a RTOS thread while other threads are operating could scramble the flash data in some manner? I’ve not read any docs implying kvstore can’t be used with rtos threads but …

Could my kvstore be full? Seems unlikely there would be no error codes returned just data corruption as a result.

Might be a peculiarity of my dev board. I don’t have a good answer for where to store a couple K of TLS data, I could add an external I2C eeprom just for my data, and leave the global kvstore to Pelion.

1 Like

I have seen something like this too in one of our applications.

We are using the Pelion client in UDP_QUEUE mode due to the nature of this application. In general, we wake up, take a data reading, and store the timestamped value. If the data value is outside of the threshold bounds or a predetermined count, we trigger a new connection and transmission.

This application utilizes a cellular connection, so in order to save power, we completely cut power to the cell module after a transmission via the Pelion client. In order to achieve this, we utilize the pause functionality of the Pelion client.

The data is currently stored in the global KVStore (along with Pelion’s data). We have seen devices run for multiple days without a hitch to then suddenly hang while the Pelion client is attempting to read its data. A power cycle usually fixes the issue, but most of the time, the device must re-bootstrap and obtain a new Pelion identity, which is obviously not ideal.

I am currently following down the same path that you (@vincemulhollon) suggested: using the SlicingBlockDevice, I have cut our flash storage into multiple “partitions”. I then created my own KVStore which uses a separate partition than the one I provide to the Pelion client and the global KVStore. So far, the results look promising, so it might be something to look into. Just a thought!