UNICODE Text to String

Gamestudio Links

Zorro Links

Newest Posts

tradeUpdatePool() but for only exiting positions
by AndrewAMD. 05/03/24 15:20

FOSS C++ DLL framework with a Python subsystem and git hooks.
by firatv. 05/02/24 09:40

Help with plotting multiple ZigZag
by degenerate_762. 04/30/24 23:23

M1 Oversampling
by 11honza11. 04/30/24 08:16

Texture/Sprite animation tool. Someone interested in ?
by NeoDumont. 04/29/24 02:20

Trading Journey
by howardR. 04/28/24 09:55

Zorro Trader GPT
by TipmyPip. 04/27/24 13:50

Data from CSV not parsed correctly
by jcl. 04/26/24 11:18

AUM Magazine

Latest Screens

Who's Online Now

2 registered members (M_D, AndrewAMD), 1,222 guests, and 1 spider.

Key: Admin, Global Mod, Mod

Newest Members

firatv, wandaluciaia, Mega_Rod, EternallyCurious, howardR
19050 Registered Users

Print Thread

Rate Thread

Page 1 of 2

UNICODE Text to String #417022 02/07/13 12:23 02/07/13 12:23
Joined: Aug 2008 Posts: 394 Germany Benni003 OP Senior Member
Benni003 OP Senior Member Joined: Aug 2008 Posts: 394 Germany	Hello, I hope someone can help me. I want to paste an text from a textfile(unicode)!! (Important, because special letters are needed) to a string. It's not working and I don't know much about this. Please can anyone help me in this? Thank you Demo to show you: Example On the link click the smaller downloadbutton under Datum:07.02.2013 Dateigr��e:0 MB Downloads: And here the code seperate: STRING* str1 = ""; STRING* str2 = ""; function main() { var file; file = file_open_read("file.txt"); file_str_readtow(file,str1," ",200); file_str_readtow(file,str2," ",200); file_close(file); printf(_chr(str1)); printf(_chr(str2)); } Last edited by Benni003; 02/07/13 12:28.

Re: UNICODE Text to String [Re: Benni003] #417025 02/07/13 13:24 02/07/13 13:24
Joined: Apr 2005 Posts: 4,506 Germany F fogman Expert
fogman Expert F Joined: Apr 2005 Posts: 4,506 Germany	printf doesn�t seem to use a unicode font. Because it works, try this: #include <acknex.h> #include <default.c> STRING* str1 = "#8"; STRING* str2 = "#8"; FONT* fontArial = "Arial#20b"; // truetype font, 20 point bold characters TEXT* txtTest = { layer = 999; string = str1; flags = SHOW; font = fontArial; } function main() { var file; file = file_open_read("file.txt"); file_str_readtow(file,str1," ",200); file_str_readtow(file,str2," ",200); file_close(file); // printf(_chr(str1)); // printf(_chr(str2)); } no science involved

Re: UNICODE Text to String [Re: Benni003] #417031 02/07/13 13:53 02/07/13 13:53
Joined: Apr 2007 Posts: 3,751 Canada WretchedSid Expert
WretchedSid Expert Joined: Apr 2007 Posts: 3,751 Canada	Yes, finally, character encoding, I was already afraid that this topic would never come up here. So, fasten your seatbelts and let's talk about character encoding; A string is composed out of a number of bits and bytes, just like your average integer, however, the interpretation of a string is a much bigger clusterfuck than the interpretation of an integer (mainly because we agreed upon how an integer looks like, we just can't always agree on the byte-order). It's a bit like the thousand and one floating point formats out there, except that it's even more fucked up because usually floating point means IEEE754 and it just works�. When you type a string in Lite-C, it's encoded in the so called ASCII format. Each character is exactly one byte in size and has a value from 0-127, which is enough to encode all english characters, a few punctuation marks some control characters etc (here is the complete list: http://www.asciitable.com/). Now, this obviously poses a problem because as it happens there are much much more characters in the world and people also want to use emojis in their test messages, so having just 128 characters is a bit meh. Luckily we came up with a billion ways to encode all kinds of characters, the most commonly used one being UTF8 (except of some IRC channels on freenode which will ban you if you use it), UTF8 in it's core uses one byte per character and has same 128 characters as ASCII, so an ASCII string is also a valid UTF8 string (hooray, someone thought about compatibility), however, a character can also be larger than a byte in UTF8 (up to 4bytes) and thus represent a character outside of the 128 ASCII characters (except that it then stops being a valid ASCII string and becomes garbage). Oh yeah, by the way, UTF8 is a Unicode encoding. Just like UTF16 and UTF32. So, what the fuck is Unicode? Unicode, or ISO 10646 if you like numbers, is a standard and not an encoding (it's a standard created because we had too many competing standards. Relevant xkcd: http://xkcd.com/927/). Unicode is basically a list of characters, currently 1114112, containing everything from latin characters, over most of the asian characters, currency symbols, mathematical symbols, scientific symbols to the good old smiling pile of poo emoji (code point U+1F4A9, if anyone wants to know). Each character in the unicode table has a so called code point, which basically is just a number, written in hex, and prefixed with U+. The � for example is U+00DF, or simply put 223 in decimal. The characters in the Unicode list are broken into so called planes, and the first 2 byte is called the common language plane and it contains the most common characters (ie the pile of poo is not there). Planes are broken into blocks, and the first block of the common language plane is the latin characters block and guess which 128 characters those are (that's what happens if you let americans design shit... after they designed a bazillion other standards on how to write latin strings). Let's go back to UTF8 and its friends UTF16 and UTF32. All three encoding are Unicode encodings, meaning that they can be used to represent Unicode strings. The difference in how they encode the unicode character is based on their size. UTF8 uses a byte per character, UTF16 uses 2 bytes and UTF32 uses 4 bytes, and is the only of the three that is able to hold all unicode characters in a single unit, but it also wastes a lot of space because not every character is 4 bytes long (as already mentioned, the most commonly used english characters can perfectly fit into 1 byte). I'm going to skip the UTF8 encoding details for now, and get straight to UTF16, which is the encoding Gamestudio uses. In UTF16 each character has a unit size of 2 bytes, but it can extend up to 4 bytes depending on what you want to encode (the smiling pile of poo for example uses 4 bytes). This is the point where it should start to get clear that a) character encoding is a fucked up art coming from a time where people couldn't make/afford RAM or hard drives larger than a few kilobytes and b) that UTF16 doesn't work with normal C Strings because they are expected to be 1 byte in size and UTF16 uses 2 bytes (so no, the problem is not that printf() uses a non Unicode font but that it works with CStrings which have a completely different unit size). So, how does iterating over a string encoded in UTF16 work? The straightforward approach would be this: Code: short myString = xyz; while(myString) { myString ++; } Except that not. Like mentioned before, not every character is 2 bytes in size, some can be larger, so unlike with C strings, making assumption of the length of a string without looking at it's content is bad and you should feel bad if you do that. The good thing, like already mentioned, is that Unicode wasn't designed by complete lunatics, so the most commonly used characters, including asian ones, are neatly put into the first 2 bytes of the unicode table, and if you only use these characters you can assume that each character is 2 bytes and be happy. Except not, because third party input can't be trusted, so let's talk about digesting a UT16 string. The easy case is that your character is 2 bytes, this includes everything up to the code point U+FFFF. The hard case is everything else. When the Unicode code points starting at U+10000 are encoded, UTF16 uses 4 bytes per character, broken into the so called lead and the trail surrogate. This is done by first subtracting 0x10000 from the 4 byte code point (leaving 20 bits) and then putting the higher 10 bits into the lead surrogate and the lower 10 bits into the trail surrogate, so the lead surrogate is a number between 0xD800 and 0xDBFF and the trail surrogate is a number between 0xDC00 and 0xDFFF. Easy? Yep, so here is how to write a correct str_length() for UTF16 encoded strings: Code: unsigned int str_lengthw(short string) { unsigned int length = 0; while(string) { // Check if this is a 4 byte character if(string >= 0xD800 && string <= 0xDBFF) { // Skip over the next 4 byte. // In reality you should then check if the trail surrogate is in the correct range as well, because, you know, third party input can't be trusted at all. string ++; } length ++; string ++; } return length; } By now two more things should be clear: a) Printf with the %s format specifier and a UTF16 string don't work b) UTF16 can't possibly fit into an ASCII string (special c:) Don't use the string directly as the format string, for heavens sake, third party input can't be trusted!!!! Now, the way to make printf() work with an UTF16 string is by converting the UTF16 string into an ASCII string, which is lossy in most cases, because UTF16 can represent much much more characters that are impossible to represent using ASCII. Writing such a conversion function is a nice exercise, and everyone here should be capable of doing that (now that they know how an UTF16 string looks like). If you can't be bothered with this, here you go (but please feel bad for about 10 minutes or so): Click to reveal.. Code: const char str_UTF16ToASCII(short string) { unsigned int length = str_lengthw(string); char buffer = malloc(length + 1); char temp = buffer; while(string) { char character = '?'; if(string <= 0x7F) character = (char)string; temp = character; if(string >= 0xD800 && string <= 0xDBFF) string ++; string ++; temp ++; } temp = '\0'; STRING tstring = str_printf(NULL, "%s", buffer); free(buffer); return _chr(tstring); } Last edited by JustSid; 02/07/13 15:52. Reason: Formatting and stuff Shitlord by trade and passion. Graphics programmer at Laminar Research. I write blog posts at feresignum.com

Re: UNICODE Text to String [Re: WretchedSid] #417059 02/07/13 23:02 02/07/13 23:02
Joined: Jan 2002 Posts: 4,225 Germany / Essen Uhrwerk Expert
Uhrwerk Expert Joined: Jan 2002 Posts: 4,225 Germany / Essen	This post is awesome. It should be printed on A1 paper and be put directly into the ACKNEX temple right next to the JCL poster. I would have voted for the wiki in the first place, but that's offline. Always learn from history, to be sure you make the same mistakes again...

Re: UNICODE Text to String [Re: Uhrwerk] #417077 02/08/13 08:43 02/08/13 08:43
Joined: Aug 2008 Posts: 394 Germany Benni003 OP Senior Member
Benni003 OP Senior Member Joined: Aug 2008 Posts: 394 Germany	Thanks for your help, it's working! And thank you JustSid for your fantastic explaination

Re: UNICODE Text to String [Re: Benni003] #417081 02/08/13 09:34 02/08/13 09:34
Joined: Aug 2008 Posts: 394 Germany Benni003 OP Senior Member
Benni003 OP Senior Member Joined: Aug 2008 Posts: 394 Germany	hm now I got another problem: I want that str_main and str_pointer will be compared and if it's the same, the engine should exit. I have also a unicode text file wich includes just "NULL", named file.txt. Can anyone help me with this? It have to be unicode, because I need special letters. #include <acknex.h> #include <default.c> FONT* fontArial = "Arial Unicode MS#30"; PANEL* panel= { pos_x=0;pos_y=0; flags=SHOW;layer=3; } STRING* str_main; STRING* str_pointer; function main() { str_main = str_create(""); str_pointer = str_create("NULL"); //--------------------------------------------- var file = file_open_read("file.txt"); file_str_readtow(file,str_main,NULL,5000); if(str_cmpni(str_main,str_pointer)==1){sys_exit("");} file_close(file); //---------------------------- pan_setstring(panel,0,0 ,0,fontArial,str_main); pan_setstring(panel,0,0,20,fontArial,str_pointer); }

Re: UNICODE Text to String [Re: Benni003] #417093 02/08/13 11:52 02/08/13 11:52
Joined: Apr 2005 Posts: 4,506 Germany F fogman Expert
fogman Expert F Joined: Apr 2005 Posts: 4,506 Germany	The problem is here: str_pointer = str_create("NULL"); Basically, if you work with Unicode, you can�t simply define any string directly in c or h files. You have to read all strings from textfiles, even for comparisons. Think about it: You try to compare ASCII with unicode, this won�t work. Solution: Read "NULL" from a unicode textfile into str_pointer. no science involved

Re: UNICODE Text to String [Re: fogman] #417094 02/08/13 11:56 02/08/13 11:56
Joined: Apr 2005 Posts: 4,506 Germany F fogman Expert
fogman Expert F Joined: Apr 2005 Posts: 4,506 Germany	I bet you have to localize a game for a publisher? Contact me if you need help, I�ve done that four times already. If you don�t have forseen unicode, it�ll we a bunch of work. You can contact me at tf [at] zappadong.de no science involved

Re: UNICODE Text to String [Re: fogman] #417097 02/08/13 12:12 02/08/13 12:12
Joined: Nov 2012 Posts: 62 Istanbul T Talemon Junior Member
Talemon Junior Member T Joined: Nov 2012 Posts: 62 Istanbul	you can define unicode strings in code, i do it like this: short null_char = '\0'; STRING* str = str_createw(&null_char);

Re: UNICODE Text to String [Re: fogman] #417100 02/08/13 13:05 02/08/13 13:05
Joined: Aug 2008 Posts: 394 Germany Benni003 OP Senior Member
Benni003 OP Senior Member Joined: Aug 2008 Posts: 394 Germany	Originally Posted By: fogman The problem is here: str_pointer = str_create("NULL"); Basically, if you work with Unicode, you can�t simply define any string directly in c or h files. You have to read all strings from textfiles, even for comparisons. Think about it: You try to compare ASCII with unicode, this won�t work. Solution: Read "NULL" from a unicode textfile into str_pointer. Your solution is good, but I tried this before. It's not correct working. //-------------------------- 1. Working: Textfile: "NULL" file = file_open_read("file.txt"); file_str_readtow(file,str1,NULL,5000); // str1 includes "NULL" from textfile if(str_cmpni(str1,str_pointer)==1){sys_exit("");} // str_pointer includes "NULL" from other file file_close(file); //-------------------------- 2. NOT Working: ( This is the needed ) Textfile: "Hello" "NULL" file = file_open_read("file.txt"); file_str_readtow(file,str1,NULL,5000); // str1 includes "Hello" from textfile file_str_readtow(file,str2,NULL,5000); // str1 includes "NULL" from textfile if(str_cmpni(str2,str_pointer)==1){sys_exit("");} // str_pointer includes "NULL" from other file file_close(file); //------------------------------ It seems like it's not working if it's not the first string from the textfile.

Page 1 of 2

Moderated by HeelX, Lukas, rayp, Rei_Ayanami, Superku, Tobias, TWO, VeT