We have a service that returns xml. The xml is not how you'd expect i.e. where a parent node represents a record and the child nodes represent fields relevant to the current record. So to be clear, it isn't like this:
2 <linkData>
3 <link>mylinkhere</link>
4 <linkText>mylink text here</linkText>
5 </linkData>
6 <linkData>
7 <link>myotherlinkhere</link>
8 <linkText>myotherlink text here</linkText>
9 </linkData>
10...etc
11</links>
Instead it concatenates the contents of fields so it would concatenate the two "link" fields in one node and concatentate the two "linkText" nodes in another node. Then it would add a "length" node to show how long each entry would be - it ends up looking looking like this:
2 <link>
3 <length>9,15</length>
4 <text>mylinkheremyotherlinkhere</text>
5 </link>
6 <linkText>
7 <length>16,21</length>
8 <text>mylink text heremyotherlink text here</text>
9 </linkText>
10</linkdata>
I assumed the the length represented the character count and based on that assumption I had used the ColdFusion Mid() function.
It turns out the length actually represented the byte length, so if there were any multibyte characters then my mid function would extract too much.
Here is a code sample using the first technique (I have represented the xml as a struct):
2
3<cfset str = {}>
4<cfset str["links"] = {lengths="53,28,25",text="http://my.url.com/with multibyte chars _???.pdfhttp://my.url.com/normal.pdfhttp://my.url.com/?.pdf",colour="red"}>
5<cfset str["linkText"] = {lengths="54,22,22",text="A link to my first multibyte url look it has ???A link to a normal urlOne more link with ?",colour="blue"}>
6
7<cfdump var="#str#">
8
9<cfset numberOfRecords = 3>
10
11<!--- Show the lengths of the urls and linkText strings. --->
12<cfoutput>
13 <p><strong>Extract the url strings and the link text strings assuming each character is a single byte character</strong></p>
14 <p><cfloop collection="#str#" item="elem">
15 <cfset positionStart = 1>
16 <strong>#elem#</strong><br />
17 (length of string = #len(str[elem].text)#)<br />
18 (sum of reported lengths = #ArraySum(ListToArray(str[elem].lengths))#)<br />
19 <cfloop from="1" to="3" index="i">
20 <cfset u = mid(str[elem].text, positionStart, ListGetAt(str[elem].lengths,i))>
21 <cfset positionStart = positionStart + ListGetAt(str[elem].lengths,i)>
22 <span style="color:#str[elem].colour#;">#u#</span><br />
23 </cfloop>
24 </cfloop></p>
25</cfoutput>
The question marks '?' are supposed to represent multibyte chars, unfortunately something on this environment means it is not retaining these chars.
Anyway I found that converting the string to bytes using getBytes("UTF-8") then using Java's "java.util.Arrays" and the "copyOfRange()" method meant I could extract the bytes I required and the using CF's CharsetEncode("UTF-8") convert the bytes back into a string. See below for the revised version.
2<cfoutput>
3 <p><strong>Extract the url strings and the link text strings</strong></p>
4 <cfset arr = CreateObject("java","java.util.Arrays")>
5 <p><cfloop collection="#str#" item="elem">
6 <cfset positionStart = 0>
7 <cfset positionSum = 0>
8 <strong>#elem#</strong><br />
9 <cfset b = str[elem].text.getBytes("utf-8")>
10 (length of byte array = #len(b)#)<br />
11 (sum of reported lengths = #ArraySum(ListToArray(str[elem].lengths))#)<br />
12 <cfloop from="1" to="3" index="i">
13 <cfset positionStart = positionSum>
14 <cfset positionSum = positionSum + ListGetAt(str[elem].lengths,i)>
15 <cfset ua = arr.copyOfRange(b, positionStart, positionSum)>
16 <cfset u = CharsetEncode(ua,'utf-8')>
17 <span style="color:#str[elem].colour#;">#u#</span><br />
18 </cfloop>
19 </cfloop></p>
20</cfoutput>
Here is a link to the sample.

