Consulting

Results 1 to 7 of 7

Thread: Need Help developing an Algorithm to Decode MathML (XML)

  1. #1

    Need Help developing an Algorithm to Decode MathML (XML)

    First , I will start by saying that I am NOT looking for someone to write my code for me. I would just like some guidance/brainstorming to see how different folks might attack this. I have included what I have so far below.

    Second, Yeah. I know it is going to be a pain. But I need to do it for work.

    Here is an example:

    The quadratic formula, x = ( -b + (b^2-4*a*c)^(1/2))/(2*a) would be represented in MathML something like:

    <mrow>
        <mi>x</mi>
        <mo>=</mo>
        <mfrac>
          <mrow>
            <mo form="prefix">&minus;</mo>
            <mi>b</mi>
            <mo>&PlusMinus;</mo>
            <msqrt>
              <msup>
                <mi>b</mi>
                <mn>2</mn>
              </msup>
              <mo>&minus;</mo>
              <mn>4</mn>
              <mo>&InvisibleTimes;</mo>
              <mi>a</mi>
              <mo>&InvisibleTimes;</mo>
              <mi>c</mi>
            </msqrt>
          </mrow>
          <mrow>
            <mn>2</mn>
            <mo>&InvisibleTimes;</mo>
            <mi>a</mi>
          </mrow>
        </mfrac>
      </mrow>
    </math>
    I need to take the above code and turn in back into a regular (calculator-type) expression like x = ( -b + (b^2-4*a*c)^(1/2))/(2*a) .

    Now what I (with help) have accomplished so far is to have all of the code read onto VBA and inserted into an array such that each element of the ML is an element of the array.

    For example:
    Array(1) = <mrow>
    Array(2) = <mi>
    Array(3) = x
    Array(4) = <mi>

    and soooo on...

    Okay let's add in some of these element's definitions:

    mrow—displays its subelements in a horizontal row.
    mi—represents an identifier such as the name of a function or variable.
    mo—represents an operator or delimiter.
    mn—represents a number.

    My idea at this point is as follows:

    The tags that get wrapped around each term are analogous to sets of parenthesis; so I will describe my idea in terms of this analogy.

    My idea is based on the distance between a left and right parenthesis. When working with a simple arithmetic expression with multiple sets of parentheses, one could denote the location of the 'innermost' left parenthesis and then denote its corresponding right parenthesis as the 'closest' right-parenthesis to it.

    The backing out from the center of the expression, the next left-parenthesis would be matched to the 2nd closest right-parenthesis and so on.

    This analogy could be extended on to the MathML by noting that there now different 'flavors' of parenthesis.

    I know there will be some difficulties that arise. For one, how do actually determine what should be the 'innermost' set?

    Secondly, what if there is a 'lone pair' somewhere for example ((()))+()

    Even the most "trivial" of suggestions would be appreciated.

    Thanks!!

  2. #2
    Hi

    I didn't follow your lead, but started down another path instead. I'll describe my thinking and you decide whether or not it is useful.
    But first, a little correction. You say you can import the MathML code into a string array like this:
    Array(1) = <mrow>
    Array(2) = <mi>
    Array(3) = x
    Array(4) = <mi>

    Maybe it's only a typo, but if not, then the importing code should be modified so that Array(4) = </mi> . It would be so much easier to locate the corresponding closing part of a tag this way.

    Now, about the algorithm. I would look for certain operators in the MathML code. For example, it seems division is represented by <mfrac>. The expression a/b would look like
     <mfrac>
       <mi>
         a
       </mi>
       <mi>
         b
       </mi>
    </mfrac>
    Within an <mfrac> tag there should be 2 pairs of tags on the same level: one contains the divident, the other the divider. So, when <mfrac> tag is found in the array, take the tags on the next level, and replace them by (). In between them, a / operator should be placed in the resulting expression. Like this:
     <mfrac>
       (
         a
       )
       /
       (
         b
       )
     </mfrac>
    Then any pairs of parentheses that don't have at least one operator between them can be removed.

    The same logic goes for raising to powers. I guess it will also work for multiplication, but your example doesn't show it. For clarification purpose, please tell me how the following expression would look like in MathML:
    (a+1)*(b-1)
    Actually, without this info my whole "algorithm" so far is based on hope only.

    Jimmy
    -------------------------------------------------
    The more details you give, the easier it is to understand your question. Don't save the effort, tell us twice rather than not at all. The amount of info you give strongly influences the quality of answer, and also how fast you get it.

  3. #3
    Distinguished Lord of VBAX VBAX Grand Master Bob Phillips's Avatar
    Joined
    Apr 2005
    Posts
    25,453
    Location
    First though, looks tailor made for recursive code.
    ____________________________________________
    Nihil simul inventum est et perfectum

    Abusus non tollit usum

    Last night I dreamed of a small consolation enjoyed only by the blind: Nobody knows the trouble I've not seen!
    James Thurber

  4. #4
    Quote Originally Posted by JimmyTheHand
    Hi

    I didn't follow your lead, but started down another path instead. I'll describe my thinking and you decide whether or not it is useful.
    But first, a little correction. You say you can import the MathML code into a string array like this:
    Array(1) = <mrow>
    Array(2) = <mi>
    Array(3) = x
    Array(4) = <mi>

    Maybe it's only a typo, but if not, then the importing code should be modified so that Array(4) = </mi> . It would be so much easier to locate the corresponding closing part of a tag this way.

    Now, about the algorithm. I would look for certain operators in the MathML code. For example, it seems division is represented by <mfrac>. The expression a/b would look like
     <mfrac>
       <mi>
         a
       </mi>
       <mi>
         b
       </mi>
    </mfrac>
    Within an <mfrac> tag there should be 2 pairs of tags on the same level: one contains the divident, the other the divider. So, when <mfrac> tag is found in the array, take the tags on the next level, and replace them by (). In between them, a / operator should be placed in the resulting expression. Like this:
     <mfrac>
       (
         a
       )
       /
       (
         b
       )
     </mfrac>
    Then any pairs of parentheses that don't have at least one operator between them can be removed.

    The same logic goes for raising to powers. I guess it will also work for multiplication, but your example doesn't show it. For clarification purpose, please tell me how the following expression would look like in MathML:
    (a+1)*(b-1)
    Actually, without this info my whole "algorithm" so far is based on hope only.

    Jimmy
    Hi Jimmy
    (a+1)*(b-1)
    I believe that the above would look something like:

    <mrow>
       <mi>a</mi>
           <mo>+</mo>
       <mn>1</mn>
    </mrow>
    <mo>*</mo>
    <mrow>
      <mi>b</mi>
           <mo>-</mo>
     <mn>1</mn>
    </mrow>
    where:

    mrow—displays its subelements in a horizontal row.
    mi—represents an identifier such as the name of a function or variable.
    mo—represents an operator or delimiter.
    mn—represents a number.

  5. #5
    Knowledge Base Approver VBAX Guru macropod's Avatar
    Joined
    Jul 2008
    Posts
    4,435
    Location
    Hi Saladsamurai,

    Based on what you posted, here's a vba function for parsing your MathML code:
    Function GetFormula(InputStream As String) As String
    Dim i As Long, ArStr, StrExp As String
    InputStream = Replace(InputStream, " ", "")
    InputStream = Replace(InputStream, vbCr, "")
    InputStream = Replace(InputStream, "<", Chr(1) & "<")
    InputStream = Replace(InputStream, ">", ">" & Chr(1))
    InputStream = Replace(InputStream, ">" & Chr(1) & Chr(1) & "<", ">" & Chr(1) & "<")
    ArStr = Split(InputStream, Chr(1))
    For i = 1 To UBound(ArStr) - 1
      StrExp = ArStr(i)
      Select Case StrExp
        Case "</mrow>"
          If ArStr(i + 1) = "<mrow>" Then
            ArStr(i) = ")/"
          Else
            ArStr(i) = ")"
          End If
        Case "<mrow>": ArStr(i) = "("
        Case "&PlusMinus;": ArStr(i) = "+"
        Case "&minus;": ArStr(i) = "-"
        Case "&InvisibleTimes;": ArStr(i) = "*"
        Case "</msup>"
          ArStr(i) = ""
          ArStr(i - 2) = "^" & ArStr(i - 2)
        Case "<msqrt>": ArStr(i) = "("
        Case "</msqrt>": ArStr(i) = ")^(1/2)"
        Case Else
          If Left(StrExp, 1) = "<" Then ArStr(i) = ""
      End Select
    Next
    GetFormula = Join(ArStr, "")
    End Function
    If you put the function and the MathML code into a Word document, then select the block of MathML code you want to process, you could extract the formula with:
    Sub ParseMathML()
    MsgBox GetFormula(Selection)
    End Sub
    Last edited by macropod; 08-11-2009 at 07:03 PM.
    Cheers
    Paul Edstein
    [Fmr MS MVP - Word]

  6. #6
    Quote Originally Posted by macropod
    Hi Saladsamurai,

    Based on what you posted, here's a vba function for parsing your MathML code:
    Function GetFormula(InputStream As String) As String
    Dim i As Long, ArStr, StrExp As String
    InputStream = Replace(InputStream, " ", "")
    InputStream = Replace(InputStream, vbCr, "")
    InputStream = Replace(InputStream, "<", Chr(1) & "<")
    InputStream = Replace(InputStream, ">", ">" & Chr(1))
    InputStream = Replace(InputStream, ">" & Chr(1) & Chr(1) & "<", ">" & Chr(1) & "<")
    ArStr = Split(InputStream, Chr(1))
    For i = 1 To UBound(ArStr) - 1
      StrExp = ArStr(i)
      Select Case StrExp
        Case "</mrow>"
          If ArStr(i + 1) = "<mrow>" Then
            ArStr(i) = ")/"
          Else
            ArStr(i) = ")"
          End If
        Case "<mrow>": ArStr(i) = "("
        Case "&PlusMinus;": ArStr(i) = "+"
        Case "&minus;": ArStr(i) = "-"
        Case "&InvisibleTimes;": ArStr(i) = "*"
        Case "</msup>"
          ArStr(i) = ""
          ArStr(i - 2) = "^" & ArStr(i - 2)
        Case "<msqrt>": ArStr(i) = "("
        Case "</msqrt>": ArStr(i) = ")^(1/2)"
        Case Else
          If Left(StrExp, 1) = "<" Then ArStr(i) = ""
      End Select
    Next
    GetFormula = Join(ArStr, "")
    End Function
    If you put the function and the MathML code into a Word document, then select the block of MathML code you want to process, you could extract the formula with:
    Sub ParseMathML()
    MsgBox GetFormula(Selection)
    End Sub

    Hey there macropod

    Sorry, I am a little confused as to how I use this :? Do I really use a "Word" Document?

  7. #7
    Knowledge Base Approver VBAX Guru macropod's Avatar
    Joined
    Jul 2008
    Posts
    4,435
    Location
    Hi Saladsamurai,

    The function should work in any app that supports vba.

    You need a process for capturing the MathML string to be parsed, which I assume you have.

    Rather than writing code for the data capture, I opted for the quick & dirty approach of opening the MathML file or pasting the block of MathML code into a Word document, selecting the range to process, then running the 'ParseMathML' macro - which calls the function and passes the selected string to it.
    Cheers
    Paul Edstein
    [Fmr MS MVP - Word]

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •