Closed Bug 722788 Opened 12 years ago Closed 11 years ago

JSON number parsing is slower than it needs to be

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla24

People

(Reporter: jrmuizel, Assigned: magicxinzhang)

Details

(Whiteboard: [good first bug][mentor=Waldo][lang=c++])

Attachments

(2 files, 2 obsolete files)

Currently we always use doubles when parsing numbers. V8 parses directly to integers using integers when possible.
The code to change is GetPrefixInteger in jsnum.cpp.  Someone who knows C++ already and is looking for a relatively simple way to start working on the JS engine or Gecko would probably be able to do this without too much trouble.
OS: Mac OS X → All
Hardware: x86 → All
Whiteboard: [good first bug][mentor=Waldo][lang=c++]
Right now that method is fully base-generic.  It may be worth having a faster method for base-10 only.  That issue is conceptually orthogonal, but it might reduce complexity of implementation to address it here.
You might be able to steal code from nsCSSScanner::ParseNumber, which already has this optimization.  http://mxr.mozilla.org/mozilla-central/source/layout/style/nsCSSScanner.cpp#1136
Wonder if we could use one implementation for both, then.
That'd be cool, but I suspect the syntax of CSS numbers is just different enough from the syntax of JS numbers to make it impossible.
I'll attempt this ticket if someone could assign it to me :)
Assignee: general → david.c.seifried
You might consider working in decimal throughout, or in any case looking at ASCII -> decimal parsing code.  See: http://speleotrove.com/decimal/ for background, open source code, etc.
(In reply to David Seifried (:dseif) from comment #6)
> I'll attempt this ticket if someone could assign it to me :)

Excellent!

I should perhaps first note that I received private mail from someone else interested in working on this.  I've directed him here, and hopefully he'll be able to leave a comment soon.  Regardless which person works on this (or even both -- the learning experience is valuable even if only one patch ends up being used), there's more than enough starter bug material to go around for everyone,

Is there any other info I can provide beyond that in comment 1 that would be helpful?
Also recall that we hang out in #jsapi on irc.mozilla.org ! Jeff's nick is Waldo
Hi, I'm the one sending private mails. I've taken a brief look at this and I'll pursue it further regardless of whether my patch will be used or not. I see this as an opportunity to learn.
TokenStream::getTokenInternal() in js/src/frontend/TokenStream.cpp might also be instructive -- it's a big function, search for "Look for a decimal number".
I created js_jstrtod_harder which works on a jschar* range [start,end)
and uses an patched _strtod function called _jstrtod that also works on
an jschar* range. This eliminates the need to create c-strings out of jschar*
intervals to fit into _strtod. As a result ComputeAccurateDecimalInteger is
faster. I tried some integer arithmetic fo the GetPrefixInteger fast-path but I
didn't get it to run faster than the current floating point arithmetic, my cpu
is Intel Core 2 Duo. I also added GetPrefixIntegerDecimalFast which I call from
JSONParser::readNumber's fast-path instead of GetPrefixInteger since
GetPrefixIntegerDecimalFast assumes [start,end) to be a range of decimals.
GetPrefixInteger seems faster in those cases where ComputeAccurateDecimalInteger
is called but more or less the same speed otherwise.
So after looking at Andreas patch, I was way off on what I was doing.  So whoever can assign it to Andreas instead of me that would be cool :)

Also you may want to flag it for review Andreas so someone takes a look at it.
Ahh, I forgot to flag it :) Anyway, there's at least "one" bug in the patch, the _jsstrtod se pointer should be jschar* and not char*, same goes for js_strtod_harder. I'll submit another patch tomorrow and this time I'll remember to flag it.
Assignee: david.c.seifried → andreassonurgudmunds
Hi is this bug still open? I'm new here and I would like to take a shot.

And while I have your attention, any pointers to where to start. I see there are attempts to fix this before so I should probably start by looking at those files?

Thanks,
Assignee: andreassonurgudmunds → general
Attached patch WIP Use integer to parse numbers (obsolete) — Splinter Review
Am I in the right direction? How do I test the performance?
Attachment #737063 - Flags: feedback?(jwalden+bmo)
Comment on attachment 737063 [details] [diff] [review]
WIP Use integer to parse numbers

Review of attachment 737063 [details] [diff] [review]:
-----------------------------------------------------------------

It's been a looong time since I looked at this code, but I think this might have been the vague track I was thinking of when I proposed it.

That said, looking again, I think probably changing GetPrefixInteger is not quite the right thing to do.  In the JSON case, and in other cases (like number parsing in general), we know extra things that GetPrefixInteger doesn't.  GetPrefixInteger has to handle parseInt's bizarre behavior of returning whatever number is encoded in the "prefix" of the provided characters.  JSON and number parsing in the language frontend don't have that issue -- they know [begin, end) contain *only* valid number characters of the given base.  This patch handles the known-base case, but it doesn't do anything about the rest of it.

I think what we want to do here is not change GetPrefixInteger but rather introduce new methods that depend on this information.  Being able to depend on only-number characters is a big help here -- that means you can handle the overflow case by using the fast path only if fewer characters than are in "18446744073709551616" are present.  (We don't care about big-number parsing speed, so we can just foist it onto a slow path.  :-) )

I suggest adding a new method to jsnum.cpp and jsnum.h:

extern double js::ParseDecimalNumber(const JS::TwoByteChars chars);

Then call that from JSONParser::readNumber() in jsonparser.cpp.  You can create a TwoByteChars like so:

  TwoByteChars chars(digitStart.get(), current - digitStart);

As far as perf-testing goes: perf-testing is something of a black art.  Compile a JS shell with --enable-optimize --disable-debug, then test performance with scripts that do something like this:

  var start = Date.now();
  for (var i = 0; i < 1e7; i++)
    JSON.parse("23892398")
  var end = Date.now();
  print("Elapsed: " + (end - start) + "ms");

Note that you want the time-measurement inside the shell, not from some external program like |time|, so you're measuring the narrowest bit of behavior possible.

Then make changes and see how the performance numbers vary.  When you run tests, you ideally want to do it on as un-loaded a system as possible.  That means the fewest possible programs running.  Also if you're on a laptop make sure to be plugged in -- laptop CPUs often slow themselves down when not plugged in, and the timing numbers you'll get will be very noisy.

Let me know if you need anything more here!  And sorry for taking a bit long to get back to you.

::: js/src/jsnum.cpp
@@ +201,5 @@
> +            else
> +                break;
> +            dec = dec * base + digit;
> +        }
> +        d = static_cast<double>(dec);

The big issue to watch out for here is accuracy differences.  There are no limits on the values that can be represented in JSON literals.  Arithmetic on double will naturally saturate to the correct values in the original loop here.  Arithmetic on uint64_t poses two concerns: overflowing uint64_t space, and computing the wrong value.

The first case, overflowing uint64_t, should be triggered by a testcase like this one (beware: ECMA doesn't require super-big numbers be printed fully accurately!):

js> assertEq(JSON.parse('18446744073709551616'), Math.pow(2, 64));

I think your patch would compute a small number, not a big one.  The same issue manifests itself a bit differently when you get overflow of not just uint64_t, but integer-representable double precision:

js> var s = '9'; for (var i = 0; i < 9; i++) s += s; assertEq(JSON.parse(s), Infinity);
Infinity

The second case, inaccuracy, would trigger when uint64_t can represent a value, but double can't.  C++ says the compiler can round up or down as an implementation-defined choice.  I believe this is controlled by FPU rounding mode these days; JS semantics assume/require that to be round-to-nearest-even.  I think we can keep relying on that here, so this doesn't pose a concern.  Still, it would be good to have some extra tests for this, something like:

  var head = "1844674407370955";
  for (var i = 1616; i < 3665; i++)
  {
    assertEq(JSON.parse(head + i.toString()),
             Math.pow(2, 64));
  }
  for (var i = 3665; i < 7760; i++)
  {
    assertEq(JSON.parse(head + i.toString()),
             Math.pow(2, 64) + 4096);
  }
  assertEq(JSON.parse(head + "7760"),
           Math.pow(2, 64) + 8192);
Attachment #737063 - Flags: feedback?(jwalden+bmo)
Attached patch WIP Use integer to parse numbers (obsolete) — Splinter Review
>var start = Date.now();
>  for (var i = 0; i < 1e7; i++)
>    JSON.parse("23892398")
>  var end = Date.now();
>  print("Elapsed: " + (end - start) + "ms");
Results running above test:
with patch Elapsed: around 616ms with several tries
without patch Elapsed: around 665ms with several tries

I don't know much about performance measures. All I can tell is that the patch improves performance. But is it considered a good improvement in terms of performance? In other words, is it worth the effort to make such code change?

The new method will be called when length of chars is less than 20 or chars is less than 18000000000000000000 (Is it good enough? Checking "18446744073709551616" is kinda tedious.).
Attachment #737063 - Attachment is obsolete: true
Attachment #741673 - Flags: feedback?(jwalden+bmo)
Comment on attachment 741673 [details] [diff] [review]
WIP Use integer to parse numbers

Review of attachment 741673 [details] [diff] [review]:
-----------------------------------------------------------------

Sorry (again :-\ ) for taking so long to get back.

This is along the right lines, but it still has issues, partly because I led you astray in previous comments.  :-(  Definitely getting there, tho!  If you want to be slightly ambitious, you might consider looking at the normal JS parser and seeing if the same trick can be easily inserted there.  I don't remember whether our code there makes the decimal case stand out so much, so that we could easily do the same optimization there.

::: js/src/jsnum.cpp
@@ +190,5 @@
> +    const jschar *s = chars.start().get();
> +    for (; s < chars.end().get(); s++) {
> +        jschar c = *s;
> +        int digit = c - '0';
> +        dec = dec * 10 + digit;

As a general rule, it's always good to assert invariants, when you can do it quickly.  Here, we can assert that the number being parsed will always be exactly precise as a double, and will never overflow.

uint64_t next = dec * 10 + digit;
MOZ_ASSERT(next < DOUBLE_INTEGRAL_PRECISION_LIMIT, "next value won't be an integrally-precise double");
dec = next;

(In most cases we'd also want to assert that we didn't overflow uint64_t precision, but the double-precision limit is sufficiently lower that it will check both cases.)

::: js/src/jsnum.h
@@ +108,5 @@
>   */
>  const double DOUBLE_INTEGRAL_PRECISION_LIMIT = uint64_t(1) << 53;
>  
> +extern double
> +ParseDecimalNumber(const JS::TwoByteChars chars);

This definitely could use a comment by it, explaining what it does and when it can and can't be safely called.  That it's limited only to decimal numbers that won't exceed double integral precision is not at all obvious, unless you look to the implementation -- which shouldn't be necessary.  Something like this is what you want:

"""
Parse a decimal number encoded in |chars|.  The decimal number must be sufficiently small that it will not overflow the integrally-precise range of the double type -- that is, the number will be smaller than DOUBLE_INTEGRAL_PRECISION_LIMIT.
"""

::: js/src/jsonparser.cpp
@@ +222,5 @@
>          const jschar *dummy;
>          double d;
> +        TwoByteChars chars(digitStart.get(), current - digitStart);
> +        if (chars.length() < 20 || (chars.length() == 20 && chars[0] == '1' 
> +                                                         && chars[1] < '8')) {

I didn't do a good job of talking about the size at which integral precision gets lost, for arithmetic on doubles, as compared to on uint64_t.  :-(  Every integer can be represented up to and including 2**53 -- the boundary I mentioned earlier was wrong.  But 2**53 - 1 can't be represented exactly.  For a bunch more detail on this, look at http://en.wikipedia.org/wiki/IEEE_floating_point and http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html after that, if you're interested.  It's fascinating stuff, especially the latter.

As 2**53 is 9007199254740992, twenty characters is the wrong boundary.  So I'd suggest something like this for the comparison:

if (chars.length() < strlen("9007199254740992")) {
    // If the decimal number's shorter than the length of 2**53 (the largest
    // number a double can represent with integral precision), parse it using
    // a decimal-only parser.  This comparison is conservative but faster
    // than a fully-precise check.
    d = ParseDecimalNumber(chars);
    return numberToken(negative ? -d : d);
}

(Note that the compilers we especially care about will evaluate the strlen call at compile time and insert a constant, in optimized builds, so there's no strlen overhead here.)
Attachment #741673 - Flags: feedback?(jwalden+bmo)
(In reply to Jeff Walden [:Waldo] (remove +bmo to email) from comment #19)
> Every integer can be represented up to and including 2**53 -- the boundary I
> mentioned earlier was wrong.  But 2**53 - 1 can't be represented exactly. 

Sorry, that should be 2**53 + 1, that can't be represented exactly.  To see this in JS, observe:

js> Math.pow(2, 53)
9007199254740992
js> Math.pow(2, 53) - 1
9007199254740991
js> Math.pow(2, 53) + 1
9007199254740992
js> Math.pow(2, 53) + 2
9007199254740994
Comment on attachment 741673 [details] [diff] [review]
WIP Use integer to parse numbers

Review of attachment 741673 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jsnum.cpp
@@ +188,5 @@
> +    size_t len = chars.length();
> +    uint64_t dec = 0;
> +    const jschar *s = chars.start().get();
> +    for (; s < chars.end().get(); s++) {
> +        jschar c = *s;

Nice trick that you could use here:

RangedPtr<jschar> s = chars.start();
RangedPtr<jschar> end = chars.end();
for (; s < end; s++) {
    jschar c = *s;

This will assert that you don't read outside the string.

@@ +189,5 @@
> +    uint64_t dec = 0;
> +    const jschar *s = chars.start().get();
> +    for (; s < chars.end().get(); s++) {
> +        jschar c = *s;
> +        int digit = c - '0';

You could also assert that '0' <= c && c <= '9'.
Xin, are you still able to do that last tiny bit of pushing on this patch in the near future?  People working on the Telemetry component in the browser are expressing interest in JSON parsing perf, and it'd be nice to finish this up and see what more it buys us.  If not, I'm happy to push this the last hundred yards for you.
Flags: needinfo?(magicxinzhang)
(In reply to Jeff Walden [:Waldo] (remove +bmo to email) from comment #22)
> Xin, are you still able to do that last tiny bit of pushing on this patch in
> the near future?  People working on the Telemetry component in the browser
> are expressing interest in JSON parsing perf, and it'd be nice to finish
> this up and see what more it buys us.  If not, I'm happy to push this the
> last hundred yards for you.

Sorry for the delay. I will finish it up this weekend.
Flags: needinfo?(magicxinzhang)
Made corresponding modifications in Comment 19.
Waldo, you mentioned possibility inserting similar code to JS parser in Comment 19.  I am pretty new to JS Engine, so for now I guess I will just leave it, but I would like to learn more about JS Engine, can you point me to some docs that can help me with that? The more, the better.  Thx.
Attachment #741673 - Attachment is obsolete: true
Attachment #754247 - Flags: review?(jwalden+bmo)
Comment on attachment 754247 [details] [diff] [review]
Use integer to parse numbers

Review of attachment 754247 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good, modulo slight style nits and a little rewrapping to 79ch.  I'll fix the issues when I push this, hopefully later today after I get another review from someone else to push a second patch at the same time.  Thanks!

::: js/src/jsnum.cpp
@@ +176,5 @@
>  
> +double
> +js::ParseDecimalNumber(const JS::TwoByteChars chars)
> +{
> +    uint64_t dec = 0;

It occurs to me this method could/should assert that there's at least one number here, so I'm adding such an assert.

@@ +179,5 @@
> +{
> +    uint64_t dec = 0;
> +    RangedPtr<jschar> s = chars.start();
> +    RangedPtr<jschar> end = chars.end();
> +    for (; s < end; s++) {

This is a super-nitpick, and it may or may not make any difference.  It's not clear to me whether compilers could recognize this or not.  But given the at-least-one-number assumption, this loop will always iterate at least once.  Therefore, having the loop condition checked on the first iteration is wasted effort.  We can easily solve this with a do-while loop instead, and then we don't have to wonder whether compilers will be smart here or not.

@@ +182,5 @@
> +    RangedPtr<jschar> end = chars.end();
> +    for (; s < end; s++) {
> +        jschar c = *s;
> +        JS_ASSERT('0' <= c && c <= '9');
> +        int digit = c - '0';

Given this will always be between 0-9, I slightly prefer using a smaller type to make limits a little clearer.

@@ +184,5 @@
> +        jschar c = *s;
> +        JS_ASSERT('0' <= c && c <= '9');
> +        int digit = c - '0';
> +        uint64_t next = dec * 10 + digit;
> +        MOZ_ASSERT(next < DOUBLE_INTEGRAL_PRECISION_LIMIT, 

Trailing whitespace here.

::: js/src/jsnum.h
@@ +107,5 @@
>   * may be precisely represented using the IEEE-754 double-precision format.
>   */
>  const double DOUBLE_INTEGRAL_PRECISION_LIMIT = uint64_t(1) << 53;
>  
> +/* Parse a decimal number encoded in |chars|.  The decimal number must be

SpiderMonkey commenting style for multiline comments is like this:

/*
 * line 1
 * line 2
 * ...
 */

So the first line should be "/*", then the comment text should start on the next line.

::: js/src/jsonparser.cpp
@@ +225,5 @@
>          const jschar *dummy;
>          double d;
> +        TwoByteChars chars(digitStart.get(), current - digitStart);
> +        if (chars.length() < strlen("9007199254740992")) {
> +            // If the decimal number is shorter than the length of 2**53 + 1 

2**53, to be hyper-precise.  :-)

@@ +227,5 @@
> +        TwoByteChars chars(digitStart.get(), current - digitStart);
> +        if (chars.length() < strlen("9007199254740992")) {
> +            // If the decimal number is shorter than the length of 2**53 + 1 
> +            // (the largest number a double can represent with integral 
> +            // precision), parse it using a decimal-only parser.  This 

More trailing whitespace.

@@ +229,5 @@
> +            // If the decimal number is shorter than the length of 2**53 + 1 
> +            // (the largest number a double can represent with integral 
> +            // precision), parse it using a decimal-only parser.  This 
> +            // comparison is conservative but faster than a fully-precise check.
> +            d = ParseDecimalNumber(chars);

Nitpick, but I slightly prefer having two |double d| here, because the if-block's use is entirely distinct from the else-path's use.  Shouldn't matter to the compiler -- it'll detect the lack of dependencies between the two -- but it helps out the fast reader very slightly.  The |dummy| declaration should go after the if-decimal block, too, as it's only used in the non-decimal case.
Attachment #754247 - Flags: review?(jwalden+bmo) → review+
(In reply to Xin from comment #24)
> Waldo, you mentioned possibility inserting similar code to JS parser in
> Comment 19.  I am pretty new to JS Engine, so for now I guess I will just
> leave it, but I would like to learn more about JS Engine, can you point me
> to some docs that can help me with that? The more, the better.  Thx.

There's not too much to document here, I think -- at least, not for people familiar with compiler terminology.  Or so I think; it's possible we nonetheless could use more documentation here.

Everything that converts character sequences into executable JS bytecode is known as the "frontend", in the imaginatively-named js/src/frontend directory.  The part of frontend that parses characters into "tokens" (that the parser consumes to produce a parse tree) is the tokenizer, in TokenStream.cpp.  TokenStream::getTokenInternal handles this, and if you look through a bit you'll find the number-tokenizing code that would be amenable to change.

I'll probably file a followup bug when I land this, to apply the same change to the main parser.  If you feel like doing this, and have further questions, make 'em there -- I'll post the bug number here when I file.
https://hg.mozilla.org/integration/mozilla-inbound/rev/eca7d80bb978

Thanks again for the patch!
Assignee: general → magicxinzhang
Status: NEW → ASSIGNED
Target Milestone: --- → mozilla24
https://hg.mozilla.org/mozilla-central/rev/eca7d80bb978
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: