Gamestudio Links
Zorro Links
Newest Posts
Help with plotting multiple ZigZag
by degenerate_762. 04/30/24 23:23
M1 Oversampling
by 11honza11. 04/30/24 08:16
Trading Journey
by howardR. 04/28/24 09:55
Zorro Trader GPT
by TipmyPip. 04/27/24 13:50
Data from CSV not parsed correctly
by jcl. 04/26/24 11:18
AUM Magazine
Latest Screens
The Bible Game
A psychological thriller game
SHADOW (2014)
DEAD TASTE
Who's Online Now
2 registered members (M_D, AndrewAMD), 1,222 guests, and 1 spider.
Key: Admin, Global Mod, Mod
Newest Members
firatv, wandaluciaia, Mega_Rod, EternallyCurious, howardR
19050 Registered Users
Previous Thread
Next Thread
Print Thread
Rate Thread
Page 1 of 2 1 2
UNICODE Text to String #417022
02/07/13 12:23
02/07/13 12:23
Joined: Aug 2008
Posts: 394
Germany
Benni003 Offline OP
Senior Member
Benni003  Offline OP
Senior Member

Joined: Aug 2008
Posts: 394
Germany
Hello, I hope someone can help me.
I want to paste an text from a textfile(unicode)!! (Important, because special letters are needed) to a string.
It's not working and I don't know much about this.
Please can anyone help me in this? Thank you

Demo to show you:
Example

On the link click the smaller downloadbutton under
Datum:07.02.2013
Dateigröße:0 MB
Downloads:

And here the code seperate:

STRING* str1 = "";
STRING* str2 = "";

function main()
{
var file;

file = file_open_read("file.txt");

file_str_readtow(file,str1," ",200);
file_str_readtow(file,str2," ",200);

file_close(file);

printf(_chr(str1));
printf(_chr(str2));
}

Last edited by Benni003; 02/07/13 12:28.
Re: UNICODE Text to String [Re: Benni003] #417025
02/07/13 13:24
02/07/13 13:24
Joined: Apr 2005
Posts: 4,506
Germany
F
fogman Offline
Expert
fogman  Offline
Expert
F

Joined: Apr 2005
Posts: 4,506
Germany
printf doesn´t seem to use a unicode font.
Because it works, try this:

#include <acknex.h>
#include <default.c>

STRING* str1 = "#8";
STRING* str2 = "#8";

FONT* fontArial = "Arial#20b"; // truetype font, 20 point bold characters

TEXT* txtTest =
{
layer = 999;
string = str1;
flags = SHOW;
font = fontArial;
}


function main()
{
var file;

file = file_open_read("file.txt");

file_str_readtow(file,str1," ",200);
file_str_readtow(file,str2," ",200);

file_close(file);

// printf(_chr(str1));
// printf(_chr(str2));
}


no science involved
Re: UNICODE Text to String [Re: Benni003] #417031
02/07/13 13:53
02/07/13 13:53
Joined: Apr 2007
Posts: 3,751
Canada
WretchedSid Offline
Expert
WretchedSid  Offline
Expert

Joined: Apr 2007
Posts: 3,751
Canada
Yes, finally, character encoding, I was already afraid that this topic would never come up here.

So, fasten your seatbelts and let's talk about character encoding; A string is composed out of a number of bits and bytes, just like your average integer, however, the interpretation of a string is a much bigger clusterfuck than the interpretation of an integer (mainly because we agreed upon how an integer looks like, we just can't always agree on the byte-order). It's a bit like the thousand and one floating point formats out there, except that it's even more fucked up because usually floating point means IEEE754 and it just works™.

When you type a string in Lite-C, it's encoded in the so called ASCII format. Each character is exactly one byte in size and has a value from 0-127, which is enough to encode all english characters, a few punctuation marks some control characters etc (here is the complete list: http://www.asciitable.com/). Now, this obviously poses a problem because as it happens there are much much more characters in the world and people also want to use emojis in their test messages, so having just 128 characters is a bit meh. Luckily we came up with a billion ways to encode all kinds of characters, the most commonly used one being UTF8 (except of some IRC channels on freenode which will ban you if you use it), UTF8 in it's core uses one byte per character and has same 128 characters as ASCII, so an ASCII string is also a valid UTF8 string (hooray, someone thought about compatibility), however, a character can also be larger than a byte in UTF8 (up to 4bytes) and thus represent a character outside of the 128 ASCII characters (except that it then stops being a valid ASCII string and becomes garbage). Oh yeah, by the way, UTF8 is a Unicode encoding. Just like UTF16 and UTF32.

So, what the fuck is Unicode? Unicode, or ISO 10646 if you like numbers, is a standard and not an encoding (it's a standard created because we had too many competing standards. Relevant xkcd: http://xkcd.com/927/). Unicode is basically a list of characters, currently 1114112, containing everything from latin characters, over most of the asian characters, currency symbols, mathematical symbols, scientific symbols to the good old smiling pile of poo emoji (code point U+1F4A9, if anyone wants to know).
Each character in the unicode table has a so called code point, which basically is just a number, written in hex, and prefixed with U+. The ß for example is U+00DF, or simply put 223 in decimal. The characters in the Unicode list are broken into so called planes, and the first 2 byte is called the common language plane and it contains the most common characters (ie the pile of poo is not there). Planes are broken into blocks, and the first block of the common language plane is the latin characters block and guess which 128 characters those are (that's what happens if you let americans design shit... after they designed a bazillion other standards on how to write latin strings).

Let's go back to UTF8 and its friends UTF16 and UTF32. All three encoding are Unicode encodings, meaning that they can be used to represent Unicode strings. The difference in how they encode the unicode character is based on their size. UTF8 uses a byte per character, UTF16 uses 2 bytes and UTF32 uses 4 bytes, and is the only of the three that is able to hold all unicode characters in a single unit, but it also wastes a lot of space because not every character is 4 bytes long (as already mentioned, the most commonly used english characters can perfectly fit into 1 byte). I'm going to skip the UTF8 encoding details for now, and get straight to UTF16, which is the encoding Gamestudio uses.

In UTF16 each character has a unit size of 2 bytes, but it can extend up to 4 bytes depending on what you want to encode (the smiling pile of poo for example uses 4 bytes). This is the point where it should start to get clear that a) character encoding is a fucked up art coming from a time where people couldn't make/afford RAM or hard drives larger than a few kilobytes and b) that UTF16 doesn't work with normal C Strings because they are expected to be 1 byte in size and UTF16 uses 2 bytes (so no, the problem is not that printf() uses a non Unicode font but that it works with CStrings which have a completely different unit size).

So, how does iterating over a string encoded in UTF16 work? The straightforward approach would be this:
Code:
short *myString = xyz;
while(*myString)
{
     myString ++;
}



Except that not. Like mentioned before, not every character is 2 bytes in size, some can be larger, so unlike with C strings, making assumption of the length of a string without looking at it's content is bad and you should feel bad if you do that. The good thing, like already mentioned, is that Unicode wasn't designed by complete lunatics, so the most commonly used characters, including asian ones, are neatly put into the first 2 bytes of the unicode table, and if you only use these characters you can assume that each character is 2 bytes and be happy. Except not, because third party input can't be trusted, so let's talk about digesting a UT16 string.

The easy case is that your character is 2 bytes, this includes everything up to the code point U+FFFF.
The hard case is everything else. When the Unicode code points starting at U+10000 are encoded, UTF16 uses 4 bytes per character, broken into the so called lead and the trail surrogate. This is done by first subtracting 0x10000 from the 4 byte code point (leaving 20 bits) and then putting the higher 10 bits into the lead surrogate and the lower 10 bits into the trail surrogate, so the lead surrogate is a number between 0xD800 and 0xDBFF and the trail surrogate is a number between 0xDC00 and 0xDFFF.
Easy? Yep, so here is how to write a correct str_length() for UTF16 encoded strings:
Code:
unsigned int str_lengthw(short *string)
{
    unsigned int length = 0;
    while(*string)
    {
        // Check if this is a 4 byte character
        if(*string >= 0xD800 && *string <= 0xDBFF)
        {
            // Skip over the next 4 byte.
            // In reality you should then check if the trail surrogate is in the correct range as well, because, you know, third party input can't be trusted at all.
            string ++;
        }
    
        length ++;
        string ++;
    }
    
    return length;
}



By now two more things should be clear:
a) Printf with the %s format specifier and a UTF16 string don't work
b) UTF16 can't possibly fit into an ASCII string
(special c:) Don't use the string directly as the format string, for heavens sake, third party input can't be trusted!!!!

Now, the way to make printf() work with an UTF16 string is by converting the UTF16 string into an ASCII string, which is lossy in most cases, because UTF16 can represent much much more characters that are impossible to represent using ASCII. Writing such a conversion function is a nice exercise, and everyone here should be capable of doing that (now that they know how an UTF16 string looks like). If you can't be bothered with this, here you go (but please feel bad for about 10 minutes or so):
Click to reveal..

Code:
const char *str_UTF16ToASCII(short *string)
{
    unsigned int length = str_lengthw(string);
    char *buffer = malloc(length + 1);
    char *temp = buffer;
    
    while(*string)
    {
        char character = '?';
        if(*string <= 0x7F)
            character = (char)*string;

        *temp = character;
    
        if(*string >= 0xD800 && *string <= 0xDBFF)
            string ++;
    
        string ++;
        temp ++;
    }
    
    *temp = '\0';
    STRING *tstring = str_printf(NULL, "%s", buffer);
    free(buffer);
    
    return _chr(tstring);
}



Last edited by JustSid; 02/07/13 15:52. Reason: Formatting and stuff

Shitlord by trade and passion. Graphics programmer at Laminar Research.
I write blog posts at feresignum.com
Re: UNICODE Text to String [Re: WretchedSid] #417059
02/07/13 23:02
02/07/13 23:02
Joined: Jan 2002
Posts: 4,225
Germany / Essen
Uhrwerk Offline
Expert
Uhrwerk  Offline
Expert

Joined: Jan 2002
Posts: 4,225
Germany / Essen
This post is awesome. It should be printed on A1 paper and be put directly into the ACKNEX temple right next to the JCL poster.

I would have voted for the wiki in the first place, but that's offline.


Always learn from history, to be sure you make the same mistakes again...
Re: UNICODE Text to String [Re: Uhrwerk] #417077
02/08/13 08:43
02/08/13 08:43
Joined: Aug 2008
Posts: 394
Germany
Benni003 Offline OP
Senior Member
Benni003  Offline OP
Senior Member

Joined: Aug 2008
Posts: 394
Germany
Thanks for your help, it's working!
And thank you JustSid for your fantastic explaination laugh

Re: UNICODE Text to String [Re: Benni003] #417081
02/08/13 09:34
02/08/13 09:34
Joined: Aug 2008
Posts: 394
Germany
Benni003 Offline OP
Senior Member
Benni003  Offline OP
Senior Member

Joined: Aug 2008
Posts: 394
Germany
hm now I got another problem:
I want that str_main and str_pointer will be compared and if it's the same, the engine should exit.
I have also a unicode text file wich includes just "NULL", named file.txt.
Can anyone help me with this?
It have to be unicode, because I need special letters.

#include <acknex.h>
#include <default.c>

FONT* fontArial = "Arial Unicode MS#30";

PANEL* panel=
{
pos_x=0;pos_y=0;
flags=SHOW;layer=3;
}

STRING* str_main;
STRING* str_pointer;

function main()
{
str_main = str_create("");
str_pointer = str_create("NULL");

//---------------------------------------------
var file = file_open_read("file.txt");

file_str_readtow(file,str_main,NULL,5000);

if(str_cmpni(str_main,str_pointer)==1){sys_exit("");}

file_close(file);
//----------------------------

pan_setstring(panel,0,0 ,0,fontArial,str_main);
pan_setstring(panel,0,0,20,fontArial,str_pointer);
}

Re: UNICODE Text to String [Re: Benni003] #417093
02/08/13 11:52
02/08/13 11:52
Joined: Apr 2005
Posts: 4,506
Germany
F
fogman Offline
Expert
fogman  Offline
Expert
F

Joined: Apr 2005
Posts: 4,506
Germany
The problem is here:
str_pointer = str_create("NULL");

Basically, if you work with Unicode, you can´t simply define any string directly in c or h files. You have to read all strings from textfiles, even for comparisons. Think about it: You try to compare ASCII with unicode, this won´t work.

Solution: Read "NULL" from a unicode textfile into str_pointer.


no science involved
Re: UNICODE Text to String [Re: fogman] #417094
02/08/13 11:56
02/08/13 11:56
Joined: Apr 2005
Posts: 4,506
Germany
F
fogman Offline
Expert
fogman  Offline
Expert
F

Joined: Apr 2005
Posts: 4,506
Germany
I bet you have to localize a game for a publisher? Contact me if you need help, I´ve done that four times already. If you don´t have forseen unicode, it´ll we a bunch of work. You can contact me at tf [at] zappadong.de


no science involved
Re: UNICODE Text to String [Re: fogman] #417097
02/08/13 12:12
02/08/13 12:12
Joined: Nov 2012
Posts: 62
Istanbul
T
Talemon Offline
Junior Member
Talemon  Offline
Junior Member
T

Joined: Nov 2012
Posts: 62
Istanbul
you can define unicode strings in code, i do it like this:
short null_char = '\0';
STRING* str = str_createw(&null_char);

Re: UNICODE Text to String [Re: fogman] #417100
02/08/13 13:05
02/08/13 13:05
Joined: Aug 2008
Posts: 394
Germany
Benni003 Offline OP
Senior Member
Benni003  Offline OP
Senior Member

Joined: Aug 2008
Posts: 394
Germany
Originally Posted By: fogman
The problem is here:
str_pointer = str_create("NULL");

Basically, if you work with Unicode, you can´t simply define any string directly in c or h files. You have to read all strings from textfiles, even for comparisons. Think about it: You try to compare ASCII with unicode, this won´t work.

Solution: Read "NULL" from a unicode textfile into str_pointer.


Your solution is good, but I tried this before.
It's not correct working.

//--------------------------
1. Working:

Textfile:
"NULL"

file = file_open_read("file.txt");

file_str_readtow(file,str1,NULL,5000); // str1 includes "NULL" from textfile

if(str_cmpni(str1,str_pointer)==1){sys_exit("");} // str_pointer includes "NULL" from other file

file_close(file);

//--------------------------
2. NOT Working: ( This is the needed )

Textfile:
"Hello"
"NULL"

file = file_open_read("file.txt");

file_str_readtow(file,str1,NULL,5000); // str1 includes "Hello" from textfile
file_str_readtow(file,str2,NULL,5000); // str1 includes "NULL" from textfile

if(str_cmpni(str2,str_pointer)==1){sys_exit("");} // str_pointer includes "NULL" from other file

file_close(file);

//------------------------------

It seems like it's not working if it's not the first string from the textfile.

Page 1 of 2 1 2

Moderated by  HeelX, Lukas, rayp, Rei_Ayanami, Superku, Tobias, TWO, VeT 

Gamestudio download | chip programmers | Zorro platform | shop | Data Protection Policy

oP group Germany GmbH | Birkenstr. 25-27 | 63549 Ronneburg / Germany | info (at) opgroup.de

Powered by UBB.threads™ PHP Forum Software 7.7.1