Memory leak bug - Mosquitto client says invalid argument for loop stop

(I have replaced of the contents of original question - the text below was originally posted on SO, had been downvoted, attempted to be closed to confirm that SO is a wrong place to consider)

Using openwrt for armv5. Valgrind is not supported on the platform. I have gdb. Using Mosquitto client library 2.0.15 and related openSSL for secure connection.

The (relatively complex) application creates Mosquitto instance and works with it using its internal loop (special Mosquitto process supporting its operations). When application is being terminated all the things die properly, but there’s a mode when it needs to disconnect from broker, then stop the loop, then destroy the instance, perform library clean up, and then create new Mosquitto instance/loop and then connect - kind of from scratch.

The problem is when this resurrection happens, memory leaks. I am most certain it is related to the Mosquitto loop, but somehow should prove it and find out what holding memory from being released.

Bear with me if I am asking stupid or inappropriate questions here as I am not an expert in the matters.

I can see process memory growth using /proc/PID/maps and gdb’s maintenance info mappings, but there’re no details on who and what purpose allocated the memory blocks. If I would be able to tell system, at the stage of thread creation, to label memory it allocates to see in those outputs.

Back to the issue. This memory leak problem does not appear on the other platform with Mosquitto 2.0.18 (and for sure higher version of openSSL). The problem may be deeper that I think and we may suspect. Additional clue may show the fact that during stopping I call disconnect, when it returns, I call loop_stop, then destroy, and then library clean up, and in this chain exactly loop_stop returns error “invalid arguments provided”, while I use the same *mosq instance as with other calls - that’s why I have strong suspicion why this is an issue (probably does not get to the joining the thread returning the error).

I am not sure blindly digging into the Mosquitto source code, and then probably openSSL (which is being used by Mosquitto library) is a good idea.

What steps, from your expert point of views, I may take to approach the issue to get understanding what actually causes the problem? Maybe looking at open file handles (how)? Maybe somehow see what resources that exact Mosquitto thread which did not die has hijacked (per thread - how), and then try to figure out why?

Update: digging into the libmosquitto sources does not look as useless and hard exercise.

Update: is appears mosquitto_loop_forever within mosquitto__thread_main exits before I call loop_stop, marks thread flag as dead (mosq_ts_none) and when I call loop_stop it aborts because it thinks thread is already dead, and does not cancel it. Therefore pthread_join is not called and thus resources are not released.

What I can’t get is why thread ends with return instead of pthread_exit.

There seems to be a flaw in the threaded logic:
mosquitto__thread_main calls mosquitto_loop_forever which calls mosquitto_loop which, when returning non-success value, triggers reconnection. However, within this reconnection exercise, it checks if disconnect had been received, and in this case just exits. When it exits loop thread also exits, setting it as dead, not joining the main thread and thus making zombie allocated memory resources.

int mosquitto_loop_forever(struct mosquitto *mosq, int timeout, int max_packets)
{
	int run = 1;
	int rc = MOSQ_ERR_SUCCESS;
	unsigned long reconnect_delay;

	if(!mosq) return MOSQ_ERR_INVAL;

	mosq->reconnects = 0;

	while(run){
		do{
#ifdef HAVE_PTHREAD_CANCEL
			pthread_testcancel();
#endif
			rc = mosquitto_loop(mosq, timeout, max_packets);
		}while(run && rc == MOSQ_ERR_SUCCESS);
		/* Quit after fatal errors. */
		switch(rc){
			case MOSQ_ERR_NOMEM:
			case MOSQ_ERR_PROTOCOL:
			case MOSQ_ERR_INVAL:
			case MOSQ_ERR_NOT_FOUND:
			case MOSQ_ERR_TLS:
			case MOSQ_ERR_PAYLOAD_SIZE:
			case MOSQ_ERR_NOT_SUPPORTED:
			case MOSQ_ERR_AUTH:
			case MOSQ_ERR_ACL_DENIED:
			case MOSQ_ERR_UNKNOWN:
			case MOSQ_ERR_EAI:
			case MOSQ_ERR_PROXY:
				return rc;
			case MOSQ_ERR_ERRNO:
				break;
		}
		if(errno == EPROTO){
			return rc;
		}
		do{
#ifdef HAVE_PTHREAD_CANCEL
			pthread_testcancel();
#endif
			rc = MOSQ_ERR_SUCCESS;
			if(mosquitto__get_request_disconnect(mosq)){
				run = 0;
			}else{
				if(mosq->reconnect_delay_max > mosq->reconnect_delay){
					if(mosq->reconnect_exponential_backoff){
						reconnect_delay = mosq->reconnect_delay*(mosq->reconnects+1)*(mosq->reconnects+1);
					}else{
						reconnect_delay = mosq->reconnect_delay*(mosq->reconnects+1);
					}
				}else{
					reconnect_delay = mosq->reconnect_delay;
				}

				if(reconnect_delay > mosq->reconnect_delay_max){
					reconnect_delay = mosq->reconnect_delay_max;
				}else{
					mosq->reconnects++;
				}

				rc = interruptible_sleep(mosq, (time_t)reconnect_delay);
				if(rc) return rc;

				if(mosquitto__get_request_disconnect(mosq)){
					run = 0;
				}else{
					rc = mosquitto_reconnect(mosq);
				}
			}
		}while(run && rc != MOSQ_ERR_SUCCESS);
	}
	return rc;
}

To clarify: I am first disconnecting, then calling loop_stop with force set to true. If I do not set force to true loop_stop sometimes hangs. This setting was decided on empirical findings a time ago. In the case of memory leak, execution even does not reach evaluation of force by the way, thus it makes no sense.
I can’t upgrade to 2.0.18 on the current device as this is the latest version supported for (relatively recent) openwrt at the date when I was installing it. I recall I was trying newer versions of Mosquitto as a test, and there were problems with them (either did not compile, or did not work properly).

Maybe there’s some issue in my logic?