Month: January 2020

Solving the Unicode, UTF8, UTF16 and Text Files conundrum in VBA

Understanding Unicode variants like UTF8 and UTF16 and how they impact your Office VBA development is not so straightforward. This post will guide thru the experience of reading a text file with VBA, explain some of the pitfalls you may encounter on this path when dealing with different text encodings and file formats. We’ll shed some light on essential Unicode concepts you’ve preferred to leave aside until now, because – let’s face it – who wants to spend hours reading wikipedia or MSDN just to read a text file or understand the many rules and APIs for converting between encodings ?

No bulky and verbose .NET or undecipherable C++ code complications here.
Just immediately actionable, simple and humble, VBA code with one function to rule them all, and a 10 to 15 minutes read to understand it all.

Let’s start the experience right now. Try something:

  • Open Windows notepad and copy/paste (or type) this text:

    Fancy a café ? Or a piña colada ? – Oh, that’s so cliché!

    I have a strong impression of Déjà vu.

    You hide your true motive behind a friendly façade.

    (Just my lame try to compose words with diacritics, in english. Inspiration found here and here)

  • Save the file; let’s say in c:\temp\textfiles\notepad_text.txt

Now we’ll try to read it and display it in Visual Basic, line by line, as usual:

Executing the “Test_ReadTextFileByLine” Sub (in the debug window) from this simple code snippet should do it…

…or not (!). The accented characters don’t display correctly.

Unicode, UTF8 and UTF16

Let’s state some facts before banging our heads on that:

  • Two forms of Unicode will be of interest here: UTF8 and UTF16.
  • “Windows is Unicode“, UTF16 Unicode. So is VBA. Unicode is a big character set which is meant to be able to represent the character glyphs of different languages.
  • Unicode (UTF16) encodes a character with two bytes (a “wide” character, in extension “wide” strings).
    (Note: UCS2 is history, assume UCS2 (or UCS-2) is UTF16)
  • The representation of a character in Unicode is also called a code point.
  • UTF8: not all the characters in the Unicode character set really need two bytes of encoding. UTF8 is sort of a packed representation of a series of Unicode characters, where one or two bytes can be used to represent a wide character.

Back to reading our file

At this point, we can guess that our Notepad old friend (on Windows 10 en_US version in my setup), probably stored our text file using a UTF8 encoding, which VBA is not aware of. Let’s take a look at the bytes in the file:

We see at lines 0 and 30 that our accented “é” are encoded as the two bytes C3 and A9, so this is a UTF8 file.

Then, at some point, we’ll have to convert an UTF8 representation of string to a UTF16 VBA friendly one.

Unfortunately, VBA cannot help here, so let’s take a detour to our trustworthy Win32 API.

Converting from UTF8 to UTF16 with the Win32 API

You’ll find all the code in the demo database of my reading_text_files github repository.

The function we’ll need is MultiByteToWideChar(), which we can declare as:

We’re going to have two variable sources, byte arrays and strings, to convert to UTF16, this is the VBA API functions signatures we’ll use:

Unicode normalization

There’s more than one way to represent a combination of characters in Unicode (MSDN). Extract:

Capital A with dieresis (umlaut) can be represented either as a single Unicode code point “Ä” (U+00C4) or the combination of Capital A and the combining Dieresis character (“A” + “¨”, that is, U+0041 U+0308). Similar considerations apply for many other characters with diacritic marks.

Simply put, a problem rises if we compare two Unicode strings that conceptually are the same, but use different code points (as the example above).

There are two more Win32 API functions that can help with that. One, NormalizeString(), transforms a Unicode string to a standard form, so it can be compared with another, even if the representations are different. The other, IsNormalizedString(), tests if a Unicode string is in the expected form.

There are a number of standard forms, but mainly, the one that “compresses” the code points into one character (I mean we get the attached form of “ae” instead of the “a” and “e”) is “NormalizationC”, value 1, from the following (C++) enumeration:

Normalization is an optional step, but for security considerations, should be used.
I wrapped the API (and followed MSDN guidance) in these two VBA API functions, and two others to get any error information:

Note:

I’m not following my coding guidelines for keeping error information inside a module, because we get an error either when calling Win32 API functions or a “logical” error when using the VBA API.

Then to test if something went wrong when calling UcNormalizeString() we have to test like that:

You can see a test scenario, that I sort of translated from the ones in MSDN, in the Test_Normalization() Sub, which calls:

Back to reading our file – again

Ok, now we know for sure that our file is in UTF8.
And we know that we have a nice UTF8DecodeString() at our disposal.

Are we not tempted to make this slight adaptation to our ReadTextFileByLine() function ? (see the UTF8DecodeString call in this code):

The result:

Whaaaat ? – Let’s debug that using the provided DumpStringBytes() function:

Which brings us to that output:

As we can see (and compare with the previous file’s hex dump), we do not have an UTF8 string in the variable sLine that is read by VBA from the file.

VBA converts the line it read from the file to a double byte (UTF16) string.

We cannot use VBA to read an UTF8 encoded text file using string variables.

Solution for reading and converting an UTF8 text file

We have to open the file in binary mode and read the contents in a byte array. This way VBA doesn’t do any conversion. We then just convert the byte array to an UTF8 string with the UTF8DecodeByteArrayToString().

And finally, we get it right:

Other text file encodings and BOMs

If Notepad saves files in UTF8 encoding, there are other encodings of text and file formats.

UTF8 and UTF16 text files may have, or not, a special series of bytes at the start of the file called the BOM (Byte Order Mark). The BOM is a magic number that we can use to infer the file encoding and byte endianness (order of bytes) of the file contents.

Without the BOM, guessing the file encoding can be tough.
But when there’s one, we can use it to make the necessary conversions, like in the following GetFileText() function, that can handle the following file encodings:

  • UTF16 BE / LE (Big Endian / Little Endian) with or without BOM,
  • UTF8 with or without BOM
  • ANSI (8 bits characters text, different character sets or code pages possible)

This is the signature of the function (code in the MTextFiles module of the Reading_Text_Files.accdb project), with a bit of documentation:

There’s a “text_files_samples” directory, in the github repository, with one file for each possible text file encoding.
Note that there are no UTF16 files with no BOM, as I used Notepad++ to generate the files and there’s no option in Notepad++ to generate UTF16 files with no BOM.

The Test_ReadSampleEncodings() procedure will read and check the contents of each file with this GetFileText() function:

Conclusion

We’ve seen different representation of text and encodings like UTF8, UTF16. We’re now able to convert between those encodings. And we now know how to read text from files with some of the most common file formats we may encounter, with VBA.

From here, it should be quite easy to also write any of these formats (using files open in binary mode helps).

Downloads

Head to the Reading_Text_Files github repository to get the source code, the example files and the Access demo database.

(MIT Licence)

Localization demo application screenshot

Using national language support in Office with VBA

If you need to handle multiple languages, get some specific regional settings, or do things like implement dynamic runtime language switching in your Microsoft Office solutions, you can leverage some additional help using the national language support Win32 API in VBA.

Do you know why you have to use the US English date format when you assign a date literal to a variable in VBA or in a SQL statement, surrounding it with “#” characters (like in: MyDate = #1/6/2020# for the 6 jan. 2020) ?
That’s because VBA (and the Office Object Model) is internally language locked on the locale which language ID is #1033.

What’s a locale and a language ID ? That’s how Windows manages multi language support.

There’s a lot of information about that on MSDN and on the web. Too much in fact to get started quickly.

This post will try to help you spare some time, make that journey quick and easy and jump into action now.

Here you’ll find :

  • A short review of the key concepts regarding national language support in Windows. Just what you need to know. Like what that means, and what language IDs and locales are.

  • A VBA standard module wrapping some of the most useful NLS Windows API immediately actionable.

  • A Microsoft Access database demo application using the module.
    See the short demo video:

Localization with VBA and the Win32 API demo

National language support with the Windows API

From MSDN:

The National Language Support (NLS) functions help applications support the different language- and locale-specific needs of users around the world.

The National Language Support (NLS) functions permit applications to:

  • Set the locale for the user
  • Identify the language in which the user works
  • Retrieve strings representing times, dates, and other information formatted correctly for the specified language and locale

Locales and Languages

Here’s the short version, while the full version of this topic is on MSDN.

  • In Windows, each language is identified by a 32bits Long value, the LANGUAGE ID, which has two parts:

    • The lower 16bits are the PRIMARY LANGUAGE ID
    • The higher 16bits are the SUB LANGUAGE ID

    Each language id can be translated to a string. For example, 1033 is “en_US” and 4108 is “fr_CH”. “en” is for the PRIMARY LANGUAGE ID, and “US” for the region, is the SUB LANGUAGE ID.

  • Each LANGUAGE ID has an associated LOCALE

    A “locale” is a collection of language-related user preference information represented as a list of values

  • Each value in the locale list, is indexed by (ie accessible via) a 32bits Long value, the LOCALE ID.

  • There can be multiple locales associated with one (primary) language (“en_US”, “en_GB”, etc…).

  • By using a valid LANGUAGE ID, applications can get localized strings and regional settings, like the full or abbreviated names of days and months, the region currency name and symbol, the various date and time formats, etc…

  • There is a preinstalled set of languages (and their locale settings) that ships with Windows.

    • LANG_SYSTEM_DEFAULT is a predefined constant for the language ID used to install WIndows. SUBLANG_SYS_DEFAULT represents its sub language ID.
    • LOCALE_USER_DEFAULT is a predefined constant for the language ID set for the current user.
    • There’s a predefined set of constants (LANG_*) for PRIMARY LANGUAGE IDs.
    • There’s a predefined set of constants (SUBLANG_*) for SUB LANGUAGE IDs.

Microsoft Office Application object language settings

Each VBA project in an Office application has access to a global Application object that has a:

  • LanguageSettings member that will give us a set of language IDs (MsoAppLanguageID enumeration), among which we can distinguish:

    • The language ID used to install Office (msoLanguageIDInstall),
    • The language ID in use for the current application (msoLanguageIDUI)
  • LanguagePreferredForEditing() function, that returns True, if a given language ID is used for editing (however that applies to the application).

Example (demo database)

Localization_win32.accdb, in this github repository, is an Access Database that illustrates how we can retrieve these language and regional settings from VBA.

It contains the MIntl.bas standard Visual Basic module, that wraps the necessary Win32 API functions and declarations. It also provides shortcut functions for some frequently used settings.

More on globalization

Globalization Documentation portal on MSDN.

Downloads

The files of this project are in the github repository, including the demo database (MIT license).

Explaining the real time device detector Access demo database

In this post, we’ll examine and explain how the Access application that demonstrates a possible, real life use case, of the device detection feature of the AxDeviceDetector ActiveX server.

See the downloads section below to download a copy of the database.

Scenario

We want to allow, let’s say, restaurant waiters or employees, to register their shifts by typing in their password, clicking a (start or end) button, and then plugging a USB key into the computer, to confirm they are who they say to be. This is an instance of a multi factor authentication, or MFA, system.

The DeviceDetectorAuthDemo.accdb demo database is essentially a finite state machine implementation.

Using USB device for multifactor authentication demo

States and transitions

The frmMain form can be in one of these 4 states:

  1. Select employee
  2. Enter password
  3. Wait for action
  4. Wait for device detection

It’s a cycle, when the wait for the device detection ends, we return on the first state (Select employee).

There is a constant definition for each of these states:

When placed in one of these states, the form needs to update its UI, so only the next possible state(s) can be triggered. This is done in the EnableStepControls() method. This method also updates the UI to show in which state number it is, by moving a green square (it’s a label control) under the correct state number label.

State transition triggers

There are two classes of events that provoke the transition between these states:

  • UI elements actions

    • Double-click on the employee listbox
    • Button clicks
  • Timer event

Each event code that provokes a state transition has the responsibility to do so only if the UI holds valid values for every possible event parameters. In this simple demo, cmdOK_Click() is a good example:

To change state, DisableCurrentStep() and ActivateStep() are called in sequence.
DisableCurrentStep() executes whatever code is needed, before the state transition occurs:

ActivateStep() executes whatever code needed, before the new state is reached:

By design, there is a parameter for the transition to state 4 (Wait for device detection): the action code, which is one of these possible values defined as constants:

This way, when the device arrival/removal event sent by the AxDeviceDetector ActiveX server arrives, we know what action to take. The action code is stored in the miAction form scoped variable. Here an example, in the code executed when we click on the cmdRegisterDevice() button:

In the moDetector_OnDeviceArrival() there is a “Select Case miAction” statement that can now execute the correct action and transition back to step (ie state) #1 (select employee).

A class of state transitions is missing

Guessed which one ? – If not, then try to select another employee while you’re in state #3. You can’t.

There are at least two “cancel” actions that are missing:

  • A “cancel” button is missing, along the “OK” button
  • Another “cancel” button is missing if we want to cancel the wait for the device arrival.

Both of the action could for instance simply return us to state #1. I’ll let that for you.

About hashing

To link the USB device to an Employee, we generate a simple hash with the password, the device product id and the device serial number. That is discriminating enough to make it reasonably difficult to reverse engineer:

I’ve used here one of the first MD5 functions for VBA I googled for. You can replace that with a SHA256 to make it stronger, and add your additional parameters to generate the hash.

You should however avoid displaying and/or logging too much information, as I do for the sake of the demo, like the device product id, in a production application.

Avoiding blocking UI while handling ActiveX server events

We should avoid displaying any blocking UI elements like message boxes, input boxes or modal dialog boxes. This would prevent the server from notifying us of further events while those UI elements block the application’s VBA thread. The ActiveX server itself runs on another process/thread, so it will not block or crash. It has room for buffering 50 messages, that’s its event queue default size. When the server event queue becomes full, it’s emptied and the cleared events are lost.

About the Employee table

  • We have to store the device serial number (field AuthDeviceSerial), if we want to avoid registering an already registered device to another user;
  • The EventLog table is just here for the demo. You may want to (re)use your own logging system.
  • You should of course split your database, ie do not keep the data tables in the same database as the application, but rather link them.

That’s it !

Downloads

All the material including the DeviceDetectorAuthDemo.accdb database, is available in my AxDeviceDetector github project

 

Credits: Post photo by Kentaro Toma on Unsplash

Explaining how to detect device arrival/removal in an ActiveX server in real time

We’ve seen know how to get some useful information (like vendor ID and serial number) from attached Windows devices, using a C DLL. Interfacing the DLL with Visual Basic, was then just a matter of declaring the DLL functions in a standard module. With some more basic wrapping, we’ve obtained a simple API function and a data structure, to query for device information in Access.

Reacting to real time device arrival/removal notifications in VB/A

Getting notified in real time (well, as near as that as we can get, at least), when a USB device is plugged or unplugged from the system is the next step. This will allow us later to do things like build an Access application that can use a USB device as a multifactor authentication device (MFA).

Receiving notifications about device arrival or removal in windows is achieved by registering for that, via the RegisterDeviceNotification() Win32 API function. By then listening for the WM_DEVICECHANGE windows message, we can react accordingly.

Subclassing

VB/A cannot listen for windows messages on its own. We have to use a “subclassing” technique, where we redirect all the messages Windows sends to a VB Form (which itself is a window), to a custom routine. I use a subclassing technique derived from “Sensei” Bruce McKinney’s “Hardcore VB 5” book (free link), where we end up responding to the Windows messages we choose by implementing an ISubclass interface, in a Visual Basic Form. Here’s the snippet in frmMain that handles what we need:

Building an out-of-process ActiveX exe server

An ActiveX out-of-process server written in (classic) Visual Basic, is an independent 32 bits executable that serves objects running on their own threads, in an external process. In VBA hosts like Access or Excel, the Visual Basic code runs in a single thread of execution, so it cannot handle simultaneously the tasks of continuously responding to Windows messages and running the VBA application. I already explained why it would be a bad idea to use subclassing techniques in a VBA host application, talking about how to detect Windows session lock/unlock in VBA.

Object classes that are exposed by an ActiveX OOP server have the additional advantage of being accessible in 32 and 64 bits VBA hosts, thanks to (automatic) COM inter process communication marshalling.

I’ve used this technique in the three ActiveX servers I’m presenting on this series on my blog:

  • AxSessionLockDetector

    • Provides classes and events to detect and react when a user locks or unlocks his/her Windows session.
  • AxDeviceDetector

    • Provides classes and events to detect then there’s a device (or media) arrival or removal from the system (The one we’re discussing here).
  • AxHttpSrv

    • Provides classes and events to implement an HTTP web server in Microsoft Access.
    • Although this is the last one of this series, it is the oldest (I wrote it back in 2011), and it served as a base template for the two others.

However, when writing the ActiveX server we’re discussing here, I ran into a quite unexpected problem.

While trying to raise events or invoking OLE callbacks to notify the host for events, automation errors where popping up, instead of raising events (or executing OLE callbacks), basically rendering the solution useless. The first actionable information I found about that by googling around is in this KB article, located in a quite interesting github repository (it’s a static repository of Microsoft knowledge base articles).

A nice workaround consists in queuing the Windows messages we receive in the function implementing the ISubclass interface, and notifying them a bit later, using a timer to pop them back from the queue.

So, I quickly added a fixed size queue, stored in a circular memory array, specializing the CQueue class I presented in an earlier article where I was illustrating how a circular queue implementation works.

Strangely, I did not have to use this queue/timer technique for the two other ActiveX EXE server, but, alas, I can’t explain why is the difference between all of them.

Events or OLE callbacks ?

For an ActiveX server (or DLL for that matter), we can use two techniques to notify an object instance owner:

  • Events

    • Technically, they’re disguised OLE callbacks. You first declare the event in a class definition module, then raise the event with the RaiseEvent VB keyword;
  • OLE Callbacks

    • The client process has to give a reference to an (untyped) Object, to the ActiveX server class, that will keep a reference on it.
    • The ActiveX server class then tries a late bound call on a specific method name on the object reference, that’s the late bound call or OLE callback.
    • One key difference with raising events is that you control the callback sequence in the ActiveX server, when you chain multiple OLE callbacks; when you raise events, you can’t control in what sequence the client receives them.

The DeviceDetector class in the AxDeviceDetector ActiveX server allows for both methods to be used. For each of the two methods, you also have the option of passing parameters to the event procedures or OLE callbacks or not. If not, then the client can call methods on the class to get them. This is commented in the class source code:

 

Then

You should see a schema illustrating a bit more nicely how this ActiveX Server works, as the image of this post.
Head to the github repository to download all the related material.

In the next post, we’ll see a real life possible use case for this ActiveX server.
I’ll explain the Access Demo application that you can see here in the mean time:

Using USB device for multifactor authentication demo

Downloads

Browse the source code for more specific comments and explanations.