Editing Ctags to support UTF16

Using the exuberant ctags project is great when working on large projects in Vim. It reads your source code and creates a file that vim can use to jump to function definitions from anywhere in the project. It is especially useful when working on a new project and exploring code written by someone else. Unfortunately, it doesn’t support files that are encoded with UTF16 encoding. While I personally don’t use UTF16 for anything, the company I work for does. I wanted to be able to use Vim as an editor for our TestComplete projects, but an early version of TestComplete used UTF16 for all the script files. TestC0mplete now supports other file types, but the whole project is already in UTF16 and no one sees a reason to change it. I decided that it would be worthwhile to edit the CTags project and provide limited support for UTF 16.

Basic usage:
This version of exuberant ctags supports LittleEndian UTF-16 aka UCS-2LE encoding.
To enable support for these files, you must pass the --UTF16 parameter to the ctags executable.
Ex: ctags --UTF16 -R .

Download:
While the work was done in Linux, I was able to successfully compile it on Windows using MinGW

Code Changes:

The major code changes that I made are shown below. See the tar.gz archive for the final source code. Making the necessary changes was very straightforward. All the functions used to read from the file were in read.c. Props to the exuberant ctags coders for a well organized project.

While there are libraries that convert character sets, they tend to be large and I don’t need true UTF16 support. I’m not using any additional characters so I decided that a naive solution was acceptable. The changes I made should work for files with or without a Byte-Order-Mark (BOM).

First we must add code to detect if a file is UTF16 formatted or not. I added a filetype member to the sInputFile structure, then added this code to the fileOpen function in read.c: File.filetype = ENC_ASCII;

if(Option.UTF16Support == TRUE)
{
    /* Check for UTF16 format
     * This will either be in the form of 0xff 0xfe or
     * every other char will be 0x00
     */

    fread(c, 1, 8, File.fp);
    rewind(File.fp);
    if((c[0] == (char)0xff) && (c[1] == (char)0xfe)) {
        File.filetype = ENC_UTF16;
        fseek(File.fp, 2, SEEK_SET);
    } else {
        File.filetype = ENC_UTF16;
        for(i = 1; i < 8; i+=2) {
            if(c[i] != 0x00) {
                File.filetype = ENC_ASCII;
                break;
            }
        }
    }
}

Then, I wrote replacement functions for fgets(), getc() and ungetc() These functions determine if we are reading from a UTF16 or ASCII (UTF8) file and perform the specified operation accordingly. Basically they just read twice as much data as requested if we are dealing with a UTF16 file.

Next I wrote replacement functions for the getc, ungetc, and fgets functions in the C library. These functions are the only functions used by ctags to read the source code from disk. These functions determine if we are reading from a UTF16 or ASCII (UTF8) file and perform the specified operation accordingly. Basically they just read twice as much data as requested if we are dealing with a UTF16 file and discard the empty byte.

/*  This function replaces getc and performs utf16 to ascii conversion if necessary
 */

static int Getc(FILE * stream)
{
    int c = getc(stream);
    if(File.filetype == ENC_UTF16) {
        getc(stream);
    }
    return c;
}

/*  This function replaces ungetc and performs utf16 to ascii conversion if necessary
 */

static int Ungetc(int c, FILE * stream)
{
    if(File.filetype == ENC_UTF16) {
        ungetc(0x00, stream);
    }
    c = ungetc(c, stream);
    return c;
}

/*  This function replaces fgets and performs utf16 to ascii conversion if necessary
 */

static char* Fgets(char* str, int num, FILE * stream)
{
    char *buf, *result;
    int i, j;

    if(stream == File.fp && File.filetype == ENC_UTF16) {
        buf = malloc(num * 2 * sizeof(char));
        for(i = 0; i < num *2; i++) {
            buf[i] = 0x00;
        }
        result = fgets (buf, num*2, stream);
        if(result == NULL) {
            return NULL;
        }
        for(i = 0, j = 0; i < num; i++, j+=2) {
            str[i] = result[j];
        }
        result = str;
        free(buf);
    } else {
        result = fgets (str, num, stream);
    }
    DebugStatement (debugPrintf(DEBUG_FGETS, "%s\n", str));
    return result;
}

I also added a --UTF16 command line parameter that enables the check for a UTF16 formatted file. To do this, I had to add it to the sOptionValues structure in options.h and add a corresponding default value in options.c. I added the option to the help message as shown below:

--UTF16 Enable naive UTF16 support (Does not support extended characters) [no]

Posted in Programming | Tagged , , , | Leave a comment