[LoRaWAN] Join Sequence Stalls Occasionally

Hi,

We’ve got a difficult to replicate problem with our LoRaWAN devices. Our LoRaWAN devices are generally reliable and some have been operating for months.

We recently had 200 devices configured and running the LoRaWAN join sequence but the back-end wasn’t configured so we expected join requests to be ignored. When the back-end was finally configured, after a week, 26 of the devices failed to join.

These devices weren’t connected to the serial debug until after we realised that they weren’t going to join hence precise information is difficult to come by.

I do know the application code was still running. It appears to me that something has gone wrong in the stack.

The radio driver, SX1276_LoRaRadio.cpp, contains the following code and comment:

void SX1276_LoRaRadio::handle_timeout_irq()
{
    tx_timeout_timer.detach();

    if (_rf_settings.state == RF_TX_RUNNING) {
        // Tx timeout shouldn't happen.
        // But it has been observed that when it happens it is a result of a
        // corrupted SPI transfer
        // The workaround is to put the radio in a known state.
        // Thus, we re-initialize it.
        ...
     }
}

This is a rephrasing of a comment in the LoRa-net version.

In other words the TX code caters for a problem configuring the transceiver and the TX not actually happening.

But, AFAICT, the RX code makes no such allowance and relies on either handle_dio0_irq or ‘handle_dio1_irq’ being called.

A software RX timeout timer was removed by this commit Removing software RX timeouts in RX chain. See also Refactoring LoRaRadio::receive(uint32_t) API.

It’s difficult to prove but I believe missing both handle_dio0_irq and ‘handle_dio1_irq’ after attempting to receive a join-accept message results in the join sequence stalling.

I don’t believe we see problems once joined and sending normal packets. I suspect normal packets are more robust.

Would anyone like to assure me that either there can never be a corrupted SPI transaction when configuring the RX or that the stack is resilient to such failures?

At present my feeling is that the “software RX timeouts” should be put back. Or, an RX error mechanism should be created.

If anyone has got any other ideas I’m more than willing to hear them. :slight_smile:

Regards,
Matt