Tag: programming

Solving the Unicode, UTF8, UTF16 and Text Files conundrum in VBA

Understanding Unicode variants like UTF8 and UTF16 and how they impact your Office VBA development is not so straightforward. This post will guide thru the experience of reading a text file with VBA, explain some of the pitfalls you may encounter on this path when dealing with different text encodings and file formats. We’ll shed some light on essential Unicode concepts you’ve preferred to leave aside until now, because – let’s face it – who wants to spend hours reading wikipedia or MSDN just to read a text file or understand the many rules and APIs for converting between encodings ?

No bulky and verbose .NET or undecipherable C++ code complications here.
Just immediately actionable, simple and humble, VBA code with one function to rule them all, and a 10 to 15 minutes read to understand it all.

Let’s start the experience right now. Try something:

  • Open Windows notepad and copy/paste (or type) this text:

    Fancy a café ? Or a piña colada ? – Oh, that’s so cliché!

    I have a strong impression of Déjà vu.

    You hide your true motive behind a friendly façade.

    (Just my lame try to compose words with diacritics, in english. Inspiration found here and here)

  • Save the file; let’s say in c:\temp\textfiles\notepad_text.txt

Now we’ll try to read it and display it in Visual Basic, line by line, as usual:

Executing the “Test_ReadTextFileByLine” Sub (in the debug window) from this simple code snippet should do it…

…or not (!). The accented characters don’t display correctly.

Unicode, UTF8 and UTF16

Let’s state some facts before banging our heads on that:

  • Two forms of Unicode will be of interest here: UTF8 and UTF16.
  • “Windows is Unicode“, UTF16 Unicode. So is VBA. Unicode is a big character set which is meant to be able to represent the character glyphs of different languages.
  • Unicode (UTF16) encodes a character with two bytes (a “wide” character, in extension “wide” strings).
    (Note: UCS2 is history, assume UCS2 (or UCS-2) is UTF16)
  • The representation of a character in Unicode is also called a code point.
  • UTF8: not all the characters in the Unicode character set really need two bytes of encoding. UTF8 is sort of a packed representation of a series of Unicode characters, where one or two bytes can be used to represent a wide character.

Back to reading our file

At this point, we can guess that our Notepad old friend (on Windows 10 en_US version in my setup), probably stored our text file using a UTF8 encoding, which VBA is not aware of. Let’s take a look at the bytes in the file:

We see at lines 0 and 30 that our accented “é” are encoded as the two bytes C3 and A9, so this is a UTF8 file.

Then, at some point, we’ll have to convert an UTF8 representation of string to a UTF16 VBA friendly one.

Unfortunately, VBA cannot help here, so let’s take a detour to our trustworthy Win32 API.

Converting from UTF8 to UTF16 with the Win32 API

You’ll find all the code in the demo database of my reading_text_files github repository.

The function we’ll need is MultiByteToWideChar(), which we can declare as:

We’re going to have two variable sources, byte arrays and strings, to convert to UTF16, this is the VBA API functions signatures we’ll use:

Unicode normalization

There’s more than one way to represent a combination of characters in Unicode (MSDN). Extract:

Capital A with dieresis (umlaut) can be represented either as a single Unicode code point “Ä” (U+00C4) or the combination of Capital A and the combining Dieresis character (“A” + “¨”, that is, U+0041 U+0308). Similar considerations apply for many other characters with diacritic marks.

Simply put, a problem rises if we compare two Unicode strings that conceptually are the same, but use different code points (as the example above).

There are two more Win32 API functions that can help with that. One, NormalizeString(), transforms a Unicode string to a standard form, so it can be compared with another, even if the representations are different. The other, IsNormalizedString(), tests if a Unicode string is in the expected form.

There are a number of standard forms, but mainly, the one that “compresses” the code points into one character (I mean we get the attached form of “ae” instead of the “a” and “e”) is “NormalizationC”, value 1, from the following (C++) enumeration:

Normalization is an optional step, but for security considerations, should be used.
I wrapped the API (and followed MSDN guidance) in these two VBA API functions, and two others to get any error information:

Note:

I’m not following my coding guidelines for keeping error information inside a module, because we get an error either when calling Win32 API functions or a “logical” error when using the VBA API.

Then to test if something went wrong when calling UcNormalizeString() we have to test like that:

You can see a test scenario, that I sort of translated from the ones in MSDN, in the Test_Normalization() Sub, which calls:

Back to reading our file – again

Ok, now we know for sure that our file is in UTF8.
And we know that we have a nice UTF8DecodeString() at our disposal.

Are we not tempted to make this slight adaptation to our ReadTextFileByLine() function ? (see the UTF8DecodeString call in this code):

The result:

Whaaaat ? – Let’s debug that using the provided DumpStringBytes() function:

Which brings us to that output:

As we can see (and compare with the previous file’s hex dump), we do not have an UTF8 string in the variable sLine that is read by VBA from the file.

VBA converts the line it read from the file to a double byte (UTF16) string.

We cannot use VBA to read an UTF8 encoded text file using string variables.

Solution for reading and converting an UTF8 text file

We have to open the file in binary mode and read the contents in a byte array. This way VBA doesn’t do any conversion. We then just convert the byte array to an UTF8 string with the UTF8DecodeByteArrayToString().

And finally, we get it right:

Other text file encodings and BOMs

If Notepad saves files in UTF8 encoding, there are other encodings of text and file formats.

UTF8 and UTF16 text files may have, or not, a special series of bytes at the start of the file called the BOM (Byte Order Mark). The BOM is a magic number that we can use to infer the file encoding and byte endianness (order of bytes) of the file contents.

Without the BOM, guessing the file encoding can be tough.
But when there’s one, we can use it to make the necessary conversions, like in the following GetFileText() function, that can handle the following file encodings:

  • UTF16 BE / LE (Big Endian / Little Endian) with or without BOM,
  • UTF8 with or without BOM
  • ANSI (8 bits characters text, different character sets or code pages possible)

This is the signature of the function (code in the MTextFiles module of the Reading_Text_Files.accdb project), with a bit of documentation:

There’s a “text_files_samples” directory, in the github repository, with one file for each possible text file encoding.
Note that there are no UTF16 files with no BOM, as I used Notepad++ to generate the files and there’s no option in Notepad++ to generate UTF16 files with no BOM.

The Test_ReadSampleEncodings() procedure will read and check the contents of each file with this GetFileText() function:

Conclusion

We’ve seen different representation of text and encodings like UTF8, UTF16. We’re now able to convert between those encodings. And we now know how to read text from files with some of the most common file formats we may encounter, with VBA.

From here, it should be quite easy to also write any of these formats (using files open in binary mode helps).

Downloads

Head to the Reading_Text_Files github repository to get the source code, the example files and the Access demo database.

(MIT Licence)

How to get useful Windows device information in VBA

Here are (at least) 3 properties of hardware like hard disks (whether they are HDD, SSD, USB or else), printers or anything else that Windows considers a device. These can be very useful to obtain in applications that manage enterprise assets like data and documents:

  • Vendor name,
  • Vendor ID,
  • and serial number.

These hardware properties allow us to uniquely an reliably identify a physical device.

Unfortunately, they’re not so easy to obtain. Particularly if we develop our applications with the “high level” languages that we use in database management systems like Microsoft Access (but also WinDev or FileMaker to name a few others), or in development environments that do not have the necessary, production ready, components or libraries to interact with the system.

Then there are different technologies we can use to get to these hardware properties, like:

  • Windows Management Instrumentation (or WMI), which has already morphed into MI; it is feature rich and COM/ActiveX friendly, but – boy – what a learning curve and a maintenance nightmare threat that would be, just do such a simple thing;
  • .NET, although it would be like using a cannon to kill a mosquito to fire up the .NET runtime engine just to use a nugget of its capacities to bring this small functionality to applications that are not .NET based.
  • The venerable and almighty Win32 API, that we’ll use here from C, to bridge its functionality to Visual Basic, because the Win32 API is mostly C/C++ friendly; my winner here, attaboy.

So, let’s start at the end (!), because we want to get into action immediately, don’t we ?
Some explanation will follow later in this post.

Just before jumping there, let me point out where I am going with all this stuff next:

  • We’re going to combine this code in an out of process, Visual Basic 5 ActiveX EXE server. That will allow us to detect when a USB (or other device type) is plugged or unplugged into the system, in real time;

  • That’s for a bit later, we have some other things to talk about first, but I can’t resist the urge to spoil it:
    using a similar technique (the ActiveX OOP EXE server), we’ll turn Access into a (local machine/network) Web server (whaaat ?!? – Yes, no apache, nginx, node, or whatever needed!)

Watch the (short) demo video

Get device information from an access database demo

Download

  • binaries

    • deviceinfo.dll (32bits version, MD5 sum: fb86ea314cdc4b2fd42f01b3876230d3)
    • deviceinfo64.dll (64bits version, MD5 sum: c27ce08c1c7d982bc475181d6f1cc613)
  • C DLL source code on my github

  • Access Demo database

    • deviceinfo_dll.accdb (contains the MDeviceInfo.bas module with the necessary declares, and additional wrapping to easily use the DLL with one function call and a supporting Type definition).
      Important note: copy the two DLLs (deviceinfo.dll and deviceinfo64.dll) in the same directory where you put the demo database deviceinfo_dll.accdb. The DLLs have to be in one of your PATH directories, or in the same directory of the application using them.

Some (quick) explanation

We can see the VB/A declarations that import the DLL functions in the declarations section of the MDeviceInfo.bas module:

Get to one API function

Instead of crippling our code with calls to each of the DLL functions, we can create one single function that does it all:

This way, we don’t have to worry about forgetting to call the DICreateDeviceInfo() and DIDestroyDeviceInfo() functions that are necessary for the DLL to do its job without crashes.

We gather the retrieved device information in an easy to access structure, in the following (public) Type definition:

Another way of doing it is using a class module. We’ll do that in the next post, where we set up the ActiveX OOP EXE server that will notify us of the devices arrival/removal in real time.

Sample usage

Take a look at the demo() procedure in the MMain.bas module:

About commercial use

If you’re following me thru this coding journey in my latest posts, you’ve probably noticed that I’m putting out all of the code and other assets under either the MIT or the GPL v2 licenses. While the works under the MIT license are completely free to use in all scenarios (commercial or not), that’s not true for works under the GPL v2 license, where you have to give these assets or your derived ones back to the community or whomever you distribute it.

Talking about the journey, if you’re wondering why I’m starting to making code and posts rain in these days (well, I mean, considering my personal publishing scale), allow me to give you a hint: I’ve started my sole ownership company as an independent software developer, back in may 1995…

So, I’m however going to set up a dual licensing system, so you can reuse the code/assets I publish here (or thru devinfo.net), in your commercial applications without worries. I know one buck has not the same value depending on our situation or where we live. Fear not, I can’t make specific promises for now, but I’ll do my best to be fair and affordable for everyone, leaving open the option to paypal me for a coffee or a beer, or just leave a donation (on which I’ll give back somehow), when it is reasonable.

Anyway, if you made it thru here, thank for your attention, don’t hesitate to share a link to this post where you see fit, open or join the conversation here, or drop me a tweet (I’m @francescofoti), I’ll always appreciate that.

And see you soon, because we’re not done yet 😉