PDA

View Full Version : Need Help developing an Algorithm to Decode MathML (XML)



Saladsamurai
08-10-2009, 06:34 PM
:hi: First , I will start by saying that I am NOT looking for someone to write my code for me. I would just like some guidance/brainstorming to see how different folks might attack this. I have included what I have so far below.

Second, Yeah. I know it is going to be a pain. But I need to do it for work.

Here is an example:

The quadratic formula, x = ( -b + (b^2-4*a*c)^(1/2))/(2*a) would be represented in MathML something like:


<mrow>
<mi>x</mi>
<mo>=</mo>
<mfrac>
<mrow>
<mo form="prefix">&minus;</mo>
<mi>b</mi>
<mo>&PlusMinus;</mo>
<msqrt>
<msup>
<mi>b</mi>
<mn>2</mn>
</msup>
<mo>&minus;</mo>
<mn>4</mn>
<mo>&InvisibleTimes;</mo>
<mi>a</mi>
<mo>&InvisibleTimes;</mo>
<mi>c</mi>
</msqrt>
</mrow>
<mrow>
<mn>2</mn>
<mo>&InvisibleTimes;</mo>
<mi>a</mi>
</mrow>
</mfrac>
</mrow>
</math>

I need to take the above code and turn in back into a regular (calculator-type) expression like x = ( -b + (b^2-4*a*c)^(1/2))/(2*a) .

Now what I (with help) have accomplished so far is to have all of the code read onto VBA and inserted into an array such that each element of the ML is an element of the array.

For example:
Array(1) = <mrow>
Array(2) = <mi>
Array(3) = x
Array(4) = <mi>

and soooo on...

Okay let's add in some of these element's definitions:

mrow—displays its subelements in a horizontal row.
mi—represents an identifier such as the name of a function or variable.
mo—represents an operator or delimiter.
mn—represents a number.

My idea at this point is as follows:

The tags that get wrapped around each term are analogous to sets of parenthesis; so I will describe my idea in terms of this analogy.

My idea is based on the distance between a left and right parenthesis. When working with a simple arithmetic expression with multiple sets of parentheses, one could denote the location of the 'innermost' left parenthesis and then denote its corresponding right parenthesis as the 'closest' right-parenthesis to it.

The backing out from the center of the expression, the next left-parenthesis would be matched to the 2nd closest right-parenthesis and so on.

This analogy could be extended on to the MathML by noting that there now different 'flavors' of parenthesis.

I know there will be some difficulties that arise. For one, how do actually determine what should be the 'innermost' set?

Secondly, what if there is a 'lone pair' somewhere for example ((()))+()

Even the most "trivial" of suggestions would be appreciated.

Thanks!!

JimmyTheHand
08-10-2009, 11:51 PM
Hi :hi:

I didn't follow your lead, but started down another path instead. I'll describe my thinking and you decide whether or not it is useful.
But first, a little correction. You say you can import the MathML code into a string array like this:
Array(1) = <mrow>
Array(2) = <mi>
Array(3) = x
Array(4) = <mi>
Maybe it's only a typo, but if not, then the importing code should be modified so that Array(4) = </mi> . It would be so much easier to locate the corresponding closing part of a tag this way.

Now, about the algorithm. I would look for certain operators in the MathML code. For example, it seems division is represented by <mfrac>. The expression a/b would look like

<mfrac>
<mi>
a
</mi>
<mi>
b
</mi>
</mfrac> Within an <mfrac> tag there should be 2 pairs of tags on the same level: one contains the divident, the other the divider. So, when <mfrac> tag is found in the array, take the tags on the next level, and replace them by (). In between them, a / operator should be placed in the resulting expression. Like this:

<mfrac>
(
a
)
/
(
b
)
</mfrac> Then any pairs of parentheses that don't have at least one operator between them can be removed.

The same logic goes for raising to powers. I guess it will also work for multiplication, but your example doesn't show it. For clarification purpose, please tell me how the following expression would look like in MathML:

(a+1)*(b-1) Actually, without this info my whole "algorithm" so far is based on hope only.

Jimmy

Bob Phillips
08-11-2009, 12:29 AM
First though, looks tailor made for recursive code.

Saladsamurai
08-11-2009, 12:06 PM
Hi :hi:

I didn't follow your lead, but started down another path instead. I'll describe my thinking and you decide whether or not it is useful.
But first, a little correction. You say you can import the MathML code into a string array like this:
Array(1) = <mrow>
Array(2) = <mi>
Array(3) = x
Array(4) = <mi>
Maybe it's only a typo, but if not, then the importing code should be modified so that Array(4) = </mi> . It would be so much easier to locate the corresponding closing part of a tag this way.

Now, about the algorithm. I would look for certain operators in the MathML code. For example, it seems division is represented by <mfrac>. The expression a/b would look like

<mfrac>
<mi>
a
</mi>
<mi>
b
</mi>
</mfrac> Within an <mfrac> tag there should be 2 pairs of tags on the same level: one contains the divident, the other the divider. So, when <mfrac> tag is found in the array, take the tags on the next level, and replace them by (). In between them, a / operator should be placed in the resulting expression. Like this:

<mfrac>
(
a
)
/
(
b
)
</mfrac> Then any pairs of parentheses that don't have at least one operator between them can be removed.

The same logic goes for raising to powers. I guess it will also work for multiplication, but your example doesn't show it. For clarification purpose, please tell me how the following expression would look like in MathML:

(a+1)*(b-1) Actually, without this info my whole "algorithm" so far is based on hope only.

Jimmy

Hi Jimmy

(a+1)*(b-1)
I believe that the above would look something like:


<mrow>
<mi>a</mi>
<mo>+</mo>
<mn>1</mn>
</mrow>
<mo>*</mo>
<mrow>
<mi>b</mi>
<mo>-</mo>
<mn>1</mn>
</mrow>

where:

mrow—displays its subelements in a horizontal row.
mi—represents an identifier such as the name of a function or variable.
mo—represents an operator or delimiter.
mn—represents a number.

macropod
08-11-2009, 06:37 PM
Hi Saladsamurai,

Based on what you posted, here's a vba function for parsing your MathML code:
Function GetFormula(InputStream As String) As String
Dim i As Long, ArStr, StrExp As String
InputStream = Replace(InputStream, " ", "")
InputStream = Replace(InputStream, vbCr, "")
InputStream = Replace(InputStream, "<", Chr(1) & "<")
InputStream = Replace(InputStream, ">", ">" & Chr(1))
InputStream = Replace(InputStream, ">" & Chr(1) & Chr(1) & "<", ">" & Chr(1) & "<")
ArStr = Split(InputStream, Chr(1))
For i = 1 To UBound(ArStr) - 1
StrExp = ArStr(i)
Select Case StrExp
Case "</mrow>"
If ArStr(i + 1) = "<mrow>" Then
ArStr(i) = ")/"
Else
ArStr(i) = ")"
End If
Case "<mrow>": ArStr(i) = "("
Case "&PlusMinus;": ArStr(i) = "+"
Case "&minus;": ArStr(i) = "-"
Case "&InvisibleTimes;": ArStr(i) = "*"
Case "</msup>"
ArStr(i) = ""
ArStr(i - 2) = "^" & ArStr(i - 2)
Case "<msqrt>": ArStr(i) = "("
Case "</msqrt>": ArStr(i) = ")^(1/2)"
Case Else
If Left(StrExp, 1) = "<" Then ArStr(i) = ""
End Select
Next
GetFormula = Join(ArStr, "")
End Function
If you put the function and the MathML code into a Word document, then select the block of MathML code you want to process, you could extract the formula with:
Sub ParseMathML()
MsgBox GetFormula(Selection)
End Sub

Saladsamurai
08-12-2009, 02:28 PM
Hi Saladsamurai,

Based on what you posted, here's a vba function for parsing your MathML code:
Function GetFormula(InputStream As String) As String
Dim i As Long, ArStr, StrExp As String
InputStream = Replace(InputStream, " ", "")
InputStream = Replace(InputStream, vbCr, "")
InputStream = Replace(InputStream, "<", Chr(1) & "<")
InputStream = Replace(InputStream, ">", ">" & Chr(1))
InputStream = Replace(InputStream, ">" & Chr(1) & Chr(1) & "<", ">" & Chr(1) & "<")
ArStr = Split(InputStream, Chr(1))
For i = 1 To UBound(ArStr) - 1
StrExp = ArStr(i)
Select Case StrExp
Case "</mrow>"
If ArStr(i + 1) = "<mrow>" Then
ArStr(i) = ")/"
Else
ArStr(i) = ")"
End If
Case "<mrow>": ArStr(i) = "("
Case "&PlusMinus;": ArStr(i) = "+"
Case "&minus;": ArStr(i) = "-"
Case "&InvisibleTimes;": ArStr(i) = "*"
Case "</msup>"
ArStr(i) = ""
ArStr(i - 2) = "^" & ArStr(i - 2)
Case "<msqrt>": ArStr(i) = "("
Case "</msqrt>": ArStr(i) = ")^(1/2)"
Case Else
If Left(StrExp, 1) = "<" Then ArStr(i) = ""
End Select
Next
GetFormula = Join(ArStr, "")
End Function If you put the function and the MathML code into a Word document, then select the block of MathML code you want to process, you could extract the formula with:
Sub ParseMathML()
MsgBox GetFormula(Selection)
End Sub


Hey there macropod :hi:

Sorry, I am a little confused as to how I use this :? Do I really use a "Word" Document?

macropod
08-12-2009, 08:14 PM
Hi Saladsamurai,

The function should work in any app that supports vba.

You need a process for capturing the MathML string to be parsed, which I assume you have.

Rather than writing code for the data capture, I opted for the quick & dirty approach of opening the MathML file or pasting the block of MathML code into a Word document, selecting the range to process, then running the 'ParseMathML' macro - which calls the function and passes the selected string to it.