A Generic Substring Function

Programmer

You may not remember the PL/I programming language (that’s a roman numeral 1, not an uppercase i), but I do. Well, not a lot. What I do remember is that the code looked like the programmer. That is, it looked like the programming language that the programmer knew. While, this is true of all programming languages, PL/I was designed with this in mind.

But, back to substrings. One substring of abcde is bcd. Depending on the language you’re programming in, the substring function or method is written differently. Function is typically, substr(string,start,end). Method is typically string.substr(start,end). The differences are in the name of the function / method and in the start and end.

Let’s take excel for instance. If you’re writing a formula, the substring function is mid (for middle). mid(string,start,length). In this case, start is the starting position. And length is the number of characters to return. mid(“abcde”,2,3) returns “bcd”. If you’re in excel writing macros (vba), it’s the same function, which is nice.

COBOL is similar. SUBSTR(STRING,START,LENGTH). The only difference is the name of the function.

Javascript uses the method substr(string,start,length). The difference here is that start is an index, rather than a position. Javascript is zero-based. VBA, etc. are 1-based. Therefore index is one less than position.

According to Mozilla, substr is deprecated in favor of substring and slice. W3Schools just says, substr is part of Javascript and is supported by all major browsers. ICanUse agrees with W3Schools. And, I’m not finding anything in the javascript (ecmascript) documentation that says it’s deprecated. Perhaps Mozilla is planning to not support this in their Firefox browser. I figure as long as there’s plenty of substr functions out there, they and all the other browsers will support it. So, I continue to use it.

But what about that substring and slice function? Python doesn’t even provide substring, only slice. How are they different?

substring and slice methods use the same format, other than the name. string.slice(start,end). You can get fancy and use negative values for start and end. If you do, substring and slice work differently. I suggest that you not do that, because whoever has to maintain your code, may get confused. In my case, that would be me.

So, instead of specifying the length or number of characters to return, substring and slice specify the end index. However, they make it (IMHO) complicated. Start and end both refer to indexes. However, start is inclusive and end is exclusive, according to Python documentation. What that means is the start index is where the substring starts. The end index is one character after where the substring ends. If you weren’t confused before, you probably are now :(.

All programming languages that I researched (and I researched about 15), use one of those syntaxes to return a substring, though they might call the function / method something different. Lisp calls it subseq and uses the slice syntax.

So, how to make one substring function which allows programmers to use the syntax they’re used to? I think that’s important, because getting the parameters wrong can cause havoc. Trust me on that one.

I added x_substr function to my x_press.js library as an example generic substring function in javascript. First, you have to work in the language that you’re programming in. For javascript, that means start and end have to be converted to indexes, if they’re not passed that way.

You also have to know whether the programmer is passing positions or indexes (or lengths). I determine this with parms. x_substr(string,start,end,start_type,end_type). start_type and end_type are arrays. The first element of the array is pos, ind/idx, or str => Position, Index, or String. Notice that I introduced a new way of returning a substring – identify it with strings. For end_type, the first element can also be len for length. The second element is bef, at/on, aft – for before, at/on, or after. VBA syntax is at or on for start. Slice syntax is at/on for start and aft for end. Perhaps I should add sli for slice. For len, the second element should be ”. In addition to adding str for either start or end, I also added bef. I figure to be fully generic, if there’s an aft, there must be a bef.

substr(string,start,end,[‘pos’,’at’],[‘len’,”])

substr(string,start,end,[‘ind’,’at’],[‘ind’,’aft’])

After I determine what the start and end indices actually are, I have to extract the substring and return it. I used a simple for loop to accomplish this. Using slice may have been slightly faster, but not enough to matter. And besides I understand the for loop, sort of.