Main thing here is a server with high stable Internet connection for all these video streams from many users at the same time.
And the app is the client of this server, "just" provider of pictures and sounds packed in some format(s) as a stream over some Internet protocol (HTTP, TCP, UDP...). And back, receiver and unpacker of the incoming stream from the server.
Sure, after login and authentication of all chat users...
Server should cache somehow the video and sound buffer from each user, combine the sounds and stream result to all in the chat.
Look for WebRTS and other protocols.
Grabbing the client screen for sharing is another special pain under each operation system...
Most common part is a server database and interface for user registration\login, listing the users, choosing them, entering to a chat and text chat itself