Random OTAA join requests caused by WDT firing

Hi there,

we recently put to desktop test 10 of our SodaqOne-V2 devices and observed that they randomly attempt OTAA joins. The devices run the smae latest version fo the software.

Initially the test was done indoors with the devices powered by USB. Later the test was run again under direct sunlight on a bright day.

Below you can see a snapshot of a graph of the second test which shows the LoRaWAN message counter being reset which is -or should be- an indicator of the devices rejoining. There are four gateways in range and the reactivation was confirmed by viewing the received messages log.

Next you can see a snapshot of the indoors test, however there is data from only 3 devices

Is this normal behaviour?

Since this is quite random (different devices rejoin every time they are reset) we still don’t have a clear idea about why this is happening but we would really appreciate any input as where to start debugging this.

Greetings,
Dimitris

PS: Since there are different repositories for the different versions of the Universal Tracker software, I thought that this should better be in the forum rather than in an old version repo issue. :slight_smile: In any case you can always reply in github

@dmamalis

It’s normal that indoors the signal is weaker and you need to switch to a higher SF to get your messages delivered at the gateway.

Did you modify the universal tracker software?

Do you use acknowledgements?
If yes, how many times do you retry?

Do you know what SF was used before the reset?

Do you use adaptive datarate (adr)?

The RN module has 3 dutycycle timers, if one of these run out the module resets and does a reconnect.
This happens when you try to use too much bandwidth.

Kind regards,
Jan

Good morning Jan,

the only modified part of the software is to defiine default values to the parameters exposed to the command line UI, thus keys and:

  • fi=1
  • gft=50
  • retry=1
  • dbg=1
  • led=1

Given that the software flow is intact, I guess you have all the answers to your questions. :slight_smile:
The devices are placed outdoors in clear sky view to minimize gft and maximize solar power.

I really really doubt that with 4 gateways in range, some of which in the same building, you need a different SF especially when with the same setup we have achieved 54km range with SodaqOnes :thumbsup:

Later today, I will redo the test trying to grab some gateway information regarding SF and RSSI. In the mean time, could you elaborate on the timers? That will help us debug the problem.

Has any of the timers anything to do with gft? I haven’t really put an eye on it but our data shows that the gps does send correct data.

Greetings,
Dimitris

Hi there,

I am back with a bit more data.

Below you can see a snapshot of a device rejoining. Again the test was conducted outdoors at direct sunlight and in proximity to 4 gateways.
The line below zero is obviously the RSSI which is pretty stable. Along with the SNR plot one can see that the signal quality is more than adequate.
The straight line is SF and it is 7. So, @Jan, to answer your question the SF is 7 when resetting.

We also plotted the GPS time to fix to make sure that there was nothing strange going on there. The GPS seems to fix pretty quickly.

Greeting,
Dimitris

Hi @dmamalis

You should be able to send every minute on SF7 although the technology is not really ment to do so.

Do you have the latest RN firmware? We recently published 1.0.4 for Europe.

Do you see the same behaviour when you extend the fi?

Regards,
Jan

I think the airtime of the GPS tracker message at SF7 fits nicely in the 1% rule. After all the RN2483 stack takes good care of that in order to be compliant with the ISM band rules. Even if transmitting too much was the problem, the LoRa stack would just hold back the excess messages. It would definitely not reJoin or lose the FcntUp.

I will attempt with a different fi and come back at this. So far, during the last two days only 3 out of 10 devices have made it to 2K messages

The devices where purchased about 2 months ago. We haven’t changed the firmware. I will do that if the longer fi test has the same problem.

In the mean time, you have talked about three timers that could be causing this behaviour. Could you elaborate on that please?

Goedemorgen,

we run the software twice more, once with ADR off and once with ADR off and FI=3.
Unfortunately, the results are the same and I am really wondering how this hasn’t appeared before in the forum. I will not post an image since it looks exactly the same: at random time, some of the devices rejoin.

This might seem like a minor issue, however, both our deployment scenarios - for which we purchased the last dozen of devices - are heavily affected by this. The SodaqOnes are to be deployed in two different types of landscape: very hilly and coastal water.

The devices will register in a controlled environment (adequate reception) and then will go mobile. Devices WILL get out of range but, as we have seen from our tests with other LoRa devices, a registered device which transmits while it moves on the edge of the coverage has significant chances to be received in such a landscape (frequent changes in altitude, flat rock surroundings, very low vegetation, no vegetation at all, water reflections, reflections on passibng by cargo boats etc). We need those chances to be used to transmit data and not reattempt a Join request which -under low quality signal conditions- will most probably fail.

Of course, there is ABP but having to use ABP for our deployment due to this is considered a downside.

Has anyone really tried to reproduce that?

Greetings,
DImitris

It looks like the devices rejoin due to a watchdog alarm.
This is the printout of the even caught on one of the ten test devices with the latest Firmware.

Starting getGpsFixAndTransmit()...
Calling setGpsActive()
** SodaqOne Universal Tracker v2 - 5.0.2 **
 -> CPU reset by Watchdog [32]

The Watchdog seems to fire while the code is in the setGpsActive routine. In there there are a couple of sodaq_wdt_safe_delay calls. How have these been calculated? Could this be what causes the watchdog to fire?

The Watch Dog Timer (WDT) is set to have an early warning period of 8s in the tracker software.

What that means is that unless the WDT timer is reset, the early warning interrupt will fire after that 8s period. After which, if the timer still does not get reset, the board will be reset by the WDT after an additional 8s.

The sodaq_wdt_safe_delay method regularly resets the WDT timer to ensure that a delay does not reset the board. Something in the GPS function must be blocking for at least 16s, which is resulting in the board being reset.

Thank you Gabriel. This is understood as it is normal operation of a WDT. Do you in Sodaq, or other users of ONE’s, have any ideas about what could be blocking for such a long time.

How has the 8s (16 in this case) been calculated and decided?

Since I haven’t figured out what triggers this I can’t be 100% sure for the following statement but it looks like the problem appears in setGpsActive only when it is called from the getGpsFixandTransmit function.

Moreover, all the times I have captured this, none of the debug prints of the setGPSActive gets printed out. I am just leaving this as a comment in order to help pin down where the problem might be.

I’m not sure what is causing this issue. Perhaps a few more calls to sodaq_wdt_reset()need to be added.

I’m not sure why the 8s interval was selected, however, it is the longest possible interval for the WDT library (the interval has a range of 1/64s -> 8s).

Could you provide a few more of the previous lines of that debug output, it would help to narrow down the specific series of calls made prior to the WDT reset.

You could also try removing the WDT initialisation sodaq_wdt_enable(WDT_PERIOD_8X); from setup(). If the cause is simply the WDT being tripped, then you should not see the rejoins. If something else is causing the board to lock up, then you will the board will stop transmitting.

Thank you Gabriel.

I have now removed the sodaq_wdt_enable(WDT_PERIOD_8X); from setup but I guess that was not enough.

Calling sodaq_wdt_reset()
Returned from sodaq_wdt_reset()
Calling sodaq_wdt_reset()
Entered rtcAlarmHandler()
Exiting rtcAlarmHandler()
Returned from sodaq_wdt_reset()
Entered runDefaultFixEvent()
Entered isAlternativeFixEventApplicable()
Exiting isAlternativeFixEventApplicable()
Default fix event started.
Entered getGpsFixAndTransmit()
Starting getGpsFixAndTransmit()...
Calling setGpsActive()
Calling sodaq_wdt_reset()
Returned from sodaq_wdt_reset()
Calling enable()
Calling flush()
while
+-----------------------------+
 Cannot open /dev/ttyACM0!
Returned from sodaq_wdt_reset()
Calling sodaq_wdt_reset()
Returned from sodaq_wdt_reset()
Calling sodaq_wdt_reset() 
Returned from sodaq_wdt_reset()   
** SodaqOne Universal Tracker v2 - 5.0.2 **
LoRa HWEUI: 0004A30B001F2264
-> CPU reset by Watchdog [32]

These are my debug prints. Before that there are literally hundreds of lines of calling and exiting sodaq_wdt_reset().
You can see that right before the /dev/ttyACM0 get’s lost there is a “while”. This is the first while loop sitting in the getGpsFixAndTransmit() function.

PS: debug prints is porbably not he best way to track this but that’s all I got now.

There are actually two places where the WDT is enabled in setup(), once at the beginning and then a second time after it is disabled to handled the boot up commands.

In v2 of the tracker software these are found on lines 174 & 204.

Thank you Gabriel. I will look into that too.
An interesting bird’s eye view observation is that it looks like the devices “degrade” towards the problem, meaning that once it occurs, it will occur in a slightly steady fashion. If you take a look on the second image of the first post you will see that some of the devices were rejoining right after the 8th message was transmitted. This is the reason why we where actually looking for the problem on the LoRa side rather than on the WDT.

Today, I will retry the tests with the WDT disabled and hopefully come back with more info. It starts looking like a ublox hang.

Greetings,
Dimitris

Removing/disabling the WDT will help determine if the cause is due to an accidental tripping of the WDT.

I imagine though that the root cause is something else.

Dear Gabriel,
I think the problem comes from this while loop
Such while loops are all over the Ublox code. Using the WDT to escape them is ok if you are not on LoRaWAN OTA but since this is a LoRaWAN device I think that there should be another way to escape them. Maybe a timeout and retry or sth like that.

What do you think?

Since this seems to be coming from the Ublox code, will it affect other versions of the ONE?

You could try adding a call to sodaq_wdt_reset() in the outer loop there (330-341 block). The inner loop (337-340), should be fine as this is called a maximum of 128 times.

Hi Gabriel,

unfortunately I am currently off base from our testing set up. I will try this as soon as possible but I am still wondering if you have managed to replicate this. If not, should I assume that our devices are faulty?

Sorry I haven’t been able to test this, however, I believe there might be a bug in the section you point out.

I believe the issue is that the data should be read in one operation, however, the current code reads chunks up to 128 bytes. In cases where there are more than 128 bytes available, the second read may cause a timeout. Specifically, the issue is that a stop bit is sent after reading the first 128 bytes.