-
Notifications
You must be signed in to change notification settings - Fork 270
Add indices #345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add indices #345
Conversation
toolz/itertoolz.py
Outdated
[(0, 0), (0, 1), (1, 0), (1, 1), (2, 0), (2, 1)] | ||
""" | ||
|
||
return(itertools.product(*[range(_) for _ in sizes])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to add parens around the return:
return itertools.product(*map(range, sizes))
In general, I would avoid using _
like this, most people use that when they want to ignore a variable but must assign it like:
for _ in range(num_retries):
# code that wants to run `num_retries` times but doesn't
# need to know the count
or
# I just want the first and last, but don't care about the
# middle of an iterator.
first, *_, last = sequence
Part of the reason for this convention is that many python shells (default, IPython, bpython, etc) will use _
to mean: the last evaluated line. For example:
In [16]: 2 + 2
Out[16]: 4
In [17]: _
Out[17]: 4
This makes it hard to use a repl to reason about code that uses the name _
because it will get trampled and reassigned a lot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, sorry, this was kind of sloppy on my part. Will clean this up. Thanks for the tips.
Do have an example of where you would use this? I understand how this works but I am not sure when to apply it. Maybe provide a functional example in the docstring with an array? |
f52c034
to
eb618f3
Compare
l[1][0] = 3 | ||
l[1][1] = 4 | ||
l[2][0] = 5 | ||
l[2][1] = 6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you had in mind for an example, but does this help? If not, do you have some other ideas of what you might like to see?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I am not used to using index access inside for loops, normally people just loop over the values directly and in numpy you don't want to be doing a bunch of scalar accesses like this. To help me understand can you explain some real code that you have written that uses this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try. 😄
So in some cases I have binary data that I need to split up into smaller blocks on in separate processes and potentially combine results from at different stages. This data normally is on disk and may be a single file or split across multiple files. In these cases, I need an index for each block that I will work with. While I suppose one could compute a single index for each block, it makes the code much harder to reason about and it is already somewhat complex code (e.g. adds halos to data blocks, slices out halos afterwards, etc.). Being able to have indices like this makes it easier to reason about these cases and handle arbitrary dimensions. Not to mention stitching the pieces together becomes much more straightforward.
Hopefully that makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I understand, thanks for clarifying! Looking through some of my numpy code I see there are places where I could have used something like this; however, I realized that this is in numpy as numpy.indices
. I wonder if I would want this when working with normal lists/tuples where numpy was not available. If we are going the route of allowing more functions into toolz but selectivly curating the top level namespace then I would be +1 on adding this, but -0 on putting it in the top level. This is because I think it is not immediatly obvious when this is the right function to use over just standard looping or slice indexing so it is more "advanced" than other functions in toolz.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds reasonable. I'm ok with not including it in the main namespace.
Yeah numpy.indices
is pretty different from this. Instead of doing something like this, it creates a massive array such that each index combination is specified. This ends up being pretty expensive for large arrays.
We can actually do much better if we note that much of this information is redundant and we are willing to part with having it in one big array. For most use cases, these are safe assumptions. Following them we get something like this. For decent sized arrays, it is not unreasonable to see an order of magnitude or potentially a few orders of magnitude speed up by following this strategy.*
Even if we do need a full array with all combinations like numpy.indices
, we can pack the result from the xnumpy
function linked above into an array and still cutdown the creation time to roughly half.*
* My benchmarking is still rather primitive at this point, but it does seem reliable thus far.
I think that we should create a separate repository for these kinds of If someone wants to set this up it could live under the pytoolz github org On Mon, Oct 24, 2016 at 4:57 PM, Joe Jevnik notifications@github.com
|
Adds a function for iterating over shapes. Can be handy when working with
ndarray
s or other such objects.