Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random issues in Sip Profile (suspect squlite DB corruption) #2296

Open
kusznir opened this issue Oct 30, 2023 · 20 comments
Open

Random issues in Sip Profile (suspect squlite DB corruption) #2296

kusznir opened this issue Oct 30, 2023 · 20 comments
Labels
bug Something isn't working

Comments

@kusznir
Copy link

kusznir commented Oct 30, 2023

With 1.10.10, after operating for a few days, one of my SIP profile's will develop corruption. Initially it shows itself as SOME (and only SOME, not all) domains on that profile will experience BLF "sticking". BLF will correctly update to a ringing state, with caller ID update, and it will stop blinking (ringing) correctly, but it will be stuck on in use indefinitely. The problem will escalate until eventually that profile will suddenly stop responding to SIP commands (INVITES, OPTIONS, etc), causing all devices on that profile to eventually show unregistered, and immediately stops call processing. This only effects one profile (I have 4 running on this server).

So far, it appears to be random. It has occurred on two servers (both in a BDC cluster), and does not show up immediately. Stopping freeswitch, deleting the sqlite dbs, and restarting it will fix it for a while, but the problem will eventually re-occur. No specific triggers found thus far.

I am not a programmer, so I have no idea how to do any kind of debug traces. The servers in question are in production with a total of about 700 extensions subscribed, so any issues affecting performance have to be corrected immediately (restart, clear, etc); time taken to further troubleshoot is unacceptable, as it prolongs the service outage.

We have not been able to reproduce on lab server, only production server with production traffic volumes are affected.

@kusznir kusznir added the bug Something isn't working label Oct 30, 2023
@kusznir
Copy link
Author

kusznir commented Oct 30, 2023

One more note: affected extensions with BLF sticking, if we move the extension to a different sip profile, the problem clears itself on that extension immediately (but does NOT clear any other extensions of the BLF issue on that domain).

@andywolk
Copy link
Contributor

See if that PR helps freeswitch/sofia-sip#233
Also see this discussion #2283

@themsley-voiceflex
Copy link

I think you would need to be using WSS on at least one profile to hit the bug in #2283. @kusznir does that fit your use case?

@kusznir
Copy link
Author

kusznir commented Oct 31, 2023

I do not believe the referenced bug aligns with my case. My SIP Profile has "wss-binding" disabled and I have never used it. Also, Freeswitch does not crash, and when the bug occurs, its effect is limited to a single sip profile; all other sip profiles are functioning as usual. So far, this bug has only occurred on the sip profile "internal" (5060/5061 TLS), but has happened multiple times.

@kusznir
Copy link
Author

kusznir commented Oct 31, 2023

Sorry, read more details on the PR, and it is unclear if its fixing just file corruption, or other corruption that could be limited to a single profile. We do use TLS for almost all of our SIP registrations. That said, we had about 340 devices each on two different profiles, and only one profile was affected. At no point did freeswitch in general go down/crash/etc; everything was contained to a single profile.

@themsley-voiceflex
Copy link

The bug seems to be within libsofia's tport code. I am unsure if it affects TLS connections as well as WSS but the symptoms would include random file overwrites but also random sockets being closed when they should not be. I suspect that could display as a wide variety of symptoms but I'm not sure if the ones you describe fit the bill. I am suspecting not but am no expert on this, just someone who has that issue to the point where our service was unusable until the recent PR 233 patches.

@kusznir
Copy link
Author

kusznir commented Jul 8, 2024

This issue was NOT resolved by the above referenced PR. As of Freeswitch 1.10.11, the issue remains.

@jseifeddine
Copy link

jseifeddine commented Oct 15, 2024

Can report same issues with internal profile, random.

Stopping freeswitch and deleting /var/lib/freeswitch/db and then starting seems to fix it temporarily.

I've actually now moved the db into postgres instead of sqlite - and the problem remains.
(this is done with odbc-dsn parameter on the internal sip profile

Edit:

Can't intercept either...

2024-10-15 15:51:49.862145 96.07% [ERR] switch_core_sqldb.c:1322 SQL ERR: [SELECT uuid, call_uuid, hostname FROM channels WHERE callstate IN ('RINGING', 'EARLY') AND (1 <> 1 OR direction = 'outbound' ) AND (1<>1 OR presence_id = '[email protected]' ) ORDER BY created_epoch DESC LIMIT 1 ] no such table: channels
freeswitch     1.10.12-release-10222002881-a88d069d6f~bookworm amd64

@themsley-voiceflex
Copy link

What version of sofia-sip is in use?

@kusznir
Copy link
Author

kusznir commented Oct 16, 2024 via email

@themsley-voiceflex
Copy link

I don't run f/s on Debian but a quick google for sofia sip debian package name told me the packages would be called something like libsofia-sip* - https://packages.debian.org/source/stable/sofia-sip

There were fixes to that package between 1.13.15 and 1.13.17 that stopped it from randomly writing to open filehandles that could have been pointing at anything. I saw it attempt (and fail) to write to /etc/passwd and to its own core.db file among many others.

@jseifeddine
Copy link

jseifeddine commented Oct 16, 2024

Another weird thing since upgrading to 1.10.12, this core.db with a space at the end:

ls -lh /var/lib/freeswitch/db/|grep core
-rw-r----- 1 www-data www-data 488K Oct 17 09:52  core.db
-rw-r----- 1 www-data www-data  16K Oct 15 23:45 'core.db '

@jseifeddine
Copy link

jseifeddine commented Oct 16, 2024

What version of sofia-sip is in use?

ii libsofia-sip-ua0 1.13.17-11077684608-9e5c40ed11~bookworm amd64 Sofia-SIP library runtime

@jseifeddine
Copy link

jseifeddine commented Oct 16, 2024

Just an FYI: I continue to experience the same problem after several freeSwitch updates. I haven't found a way to determine what version of Sofia is in use, can you provide a means of retrieving that from a running instance?

On Wed, Oct 16, 2024 at 8:01 AM themsley-voiceflex @.> wrote: What version of sofia-sip is in use? — Reply to this email directly, view it on GitHub <#2296 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGMKYV5PHMZALWNBTN4YBFTZ3Z5VVAVCNFSM6AAAAABKRTINX6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJXGA4TQMJQGE . You are receiving this because you were mentioned.Message ID: @.>

You can just find what version is installed on the system.
On debian based systems you could do

dpkg -l |grep sofia
ii  freeswitch-mod-sofia                   1.10.12-release-10222002881-a88d069d6f~bookworm amd64        mod_sofia for FreeSWITCH
ii  freeswitch-mod-sofia-dbg               1.10.12-release-10222002881-a88d069d6f~bookworm amd64        mod_sofia for FreeSWITCH (debug)
ii  libsofia-sip-ua0                       1.13.17-11077684608-9e5c40ed11~bookworm         amd64        Sofia-SIP library runtime

Try rpm -qa | grep sofia on yum based system

@kusznir
Copy link
Author

kusznir commented Oct 17, 2024 via email

@jseifeddine
Copy link

@kusznir you could do the following:

Find the lib file like with ldconfig -p | grep sofia-sip

	libsofia-sip-ua.so.0 (libc6,x86-64) => /lib/libsofia-sip-ua.so.0

Then find the version inside the lib file with strings /lib/libsofia-sip-ua.so.0 | grep sofia-sip-

libsofia-sip-ua.so.0
sofia-sip-1.13.17

@jseifeddine
Copy link

@kusznir

see #2619

Seem's like it's fixed but not yet released - I built from source now running 1.10.13-dev git 97cb672

I will report back - but we were having exact same issues, stuck BLF, unable to page etc.

I think the root cause is that the call channels stay open even after a call is finished.

Lets see...

@Amrrx
Copy link

Amrrx commented Oct 27, 2024

@kusznir

see #2619

Seem's like it's fixed but not yet released - I built from source now running 1.10.13-dev git 97cb672

I will report back - but we were having exact same issues, stuck BLF, unable to page etc.

I think the root cause is that the call channels stay open even after a call is finished.

Lets see...

Any update on your test?

@jseifeddine
Copy link

@kusznir

see #2619

Seem's like it's fixed but not yet released - I built from source now running 1.10.13-dev git 97cb672

I will report back - but we were having exact same issues, stuck BLF, unable to page etc.

I think the root cause is that the call channels stay open even after a call is finished.

Lets see...

Any update on your test?

Yes - it did fix it.

No longer having the problem.

@kusznir
Copy link
Author

kusznir commented Oct 27, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants