Xdiff cleanup part2 #2070

ezekielnewren · 2025-10-15T21:13:54Z

gitgitgadget-git · 2025-10-15T21:14:29Z

gitgitgadget-git · 2025-10-15T21:14:30Z

gitgitgadget-git · 2025-10-15T21:14:30Z

gitgitgadget-git · 2025-10-15T21:14:31Z

gitgitgadget-git · 2025-10-15T21:14:31Z

ezekielnewren · 2025-10-15T21:17:23Z

gitgitgadget-git · 2025-10-15T21:19:30Z

gitgitgadget-git · 2025-10-15T21:31:59Z

gitgitgadget-git · 2025-10-15T22:03:11Z

gitgitgadget-git · 2025-10-16T21:58:43Z

xdiff/xdiffi.c

 static int get_indent(xrecord_t *rec)
 {
 	long i;
 	int ret = 0;


On the Git mailing list, "Kristoffer Haugsbakk" wrote (reply to this):

On Wed, Oct 15, 2025, at 23:18, Ezekiel Newren via GitGitGadget wrote: > From: Ezekiel Newren <ezekielnewren@gmail.com> > > Rust uses u8 to refer to bytes in memory. Since xrecord_t.ptr is also > referring to bytes in memory, rather than unicode code points, use s/unicode/Unicode/ > uint8_t instead of char. > > Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com> > --- >[snip]

gitgitgadget-git · 2025-10-16T21:58:45Z

gitgitgadget-git · 2025-10-16T22:32:54Z

gitgitgadget-git · 2025-10-16T22:32:55Z

gitgitgadget-git · 2025-10-20T20:49:05Z

gitgitgadget-git · 2025-10-20T21:17:01Z

gitgitgadget-git · 2025-10-20T22:48:04Z

gitgitgadget-git · 2025-10-20T23:40:51Z

xdiff/xdiffi.c

 *  Davide Libenzi <davidel@xmailserver.org>
 *
 */



On the Git mailing list, Ezekiel Newren wrote (reply to this):

On Wed, Oct 15, 2025 at 3:18 PM Ezekiel Newren via GitGitGadget <gitgitgadget@gmail.com> wrote: > > From: Ezekiel Newren <ezekielnewren@gmail.com> > > The ha field is serving two different purposes, which makes the code > harder to read. At first glance it looks like many places assume > there could never be hash collisions between lines of the two input > files. In reality, line_hash is used together with xdl_recmatch() to > ensure correct comparisons of lines, even when collisions occur. > > To make this clearer, the old ha field has been split: > * line_hash: The straightforward hash of a line, requiring no > additional context. > * minimal_perfect_hash: Not a new concept, but now a separate > field. It comes from the classifier's general-purpose hash table, > which assigns each line a unique and minimal hash across the two > files. > > Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com> I'm a bit surprised that nobody has commented on this patch. I thought that someone would have criticized the length of the name "minimal_perfect_hash" or asked me why I was splitting one field into two. I don't see any reason why this patch series shouldn't move forward.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Ezekiel Newren <ezekielnewren@gmail.com> writes: > I'm a bit surprised that nobody has commented on this patch. I thought > that someone would have criticized the length of the name > "minimal_perfect_hash" or asked me why I was splitting one field into > two. Sometimes there aren't enough round tuits to go around, and when people have been too busy to review it, we see no comment, either positive ones or negative ones. > I don't see any reason why this patch series shouldn't move forward. A patch series needs a positive reason to move forward; unfortunately we cannot tell much from lack of negative comments.

On the Git mailing list, Patrick Steinhardt wrote (reply to this):

On Mon, Oct 20, 2025 at 05:29:25PM -0600, Ezekiel Newren wrote: > On Wed, Oct 15, 2025 at 3:18 PM Ezekiel Newren via GitGitGadget > <gitgitgadget@gmail.com> wrote: > > > > From: Ezekiel Newren <ezekielnewren@gmail.com> > > > > The ha field is serving two different purposes, which makes the code > > harder to read. At first glance it looks like many places assume > > there could never be hash collisions between lines of the two input > > files. In reality, line_hash is used together with xdl_recmatch() to > > ensure correct comparisons of lines, even when collisions occur. > > > > To make this clearer, the old ha field has been split: > > * line_hash: The straightforward hash of a line, requiring no > > additional context. > > * minimal_perfect_hash: Not a new concept, but now a separate > > field. It comes from the classifier's general-purpose hash table, > > which assigns each line a unique and minimal hash across the two > > files. > > > > Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com> > > I'm a bit surprised that nobody has commented on this patch. I thought > that someone would have criticized the length of the name > "minimal_perfect_hash" or asked me why I was splitting one field into > two. I actually appreciate the longer name. I'm not a fan of abbreviations that are hard to understand myself. Sure, they are easier to type, but in many cases they end up making the code way harder to understand if you are not deeply familiar with it. There's of course exceptions to this, but I don't really think that your patch falls into them. Patrick

On the Git mailing list, Phillip Wood wrote (reply to this):

Hi Ezekiel On 21/10/2025 00:29, Ezekiel Newren wrote: > On Wed, Oct 15, 2025 at 3:18 PM Ezekiel Newren via GitGitGadget > <gitgitgadget@gmail.com> wrote: >> >> From: Ezekiel Newren <ezekielnewren@gmail.com> >> >> The ha field is serving two different purposes, which makes the code >> harder to read. At first glance it looks like many places assume >> there could never be hash collisions between lines of the two input >> files. In reality, line_hash is used together with xdl_recmatch() to >> ensure correct comparisons of lines, even when collisions occur. >> >> To make this clearer, the old ha field has been split: >> * line_hash: The straightforward hash of a line, requiring no >> additional context. >> * minimal_perfect_hash: Not a new concept, but now a separate >> field. It comes from the classifier's general-purpose hash table, >> which assigns each line a unique and minimal hash across the two >> files. >> >> Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com> > > I'm a bit surprised that nobody has commented on this patch. I've been off the list and I haven't caught up with this series yet. > I thought > that someone would have criticized the length of the name > "minimal_perfect_hash" or asked me why I was splitting one field into > two. I think "perfect_hash" would be fine if we want a shorter name. More importantly it would be helpful to explain why the two fields have different types. I assume it is because the perfect_hash is used as an array index and therefore size_t is a better match for rust's usize than uint64_t. How much more memory do we end up using by adding second hash member to the struct? If the aim is to show that only one of them is used at a time then a union might be more appropriate but I doubt that plays well with rust. I'll try and have a look at the other patches later this week. I think the type changes are going to need careful review. Thanks Phillip

On the Git mailing list, Chris Torek wrote (reply to this):

On Tue, Oct 21, 2025 at 3:04 AM Phillip Wood <phillip.wood123@gmail.com> wrote: ... > uint64_t. How much more memory do we end up using by adding second hash > member to the struct? As in any string-to-string algorithm of this sort, there's one per "symbol", but in this case a "symbol" is a line in a file. So if files are M and N lines long, there are M+N symbols. Take the difference of the size of the two records and multiply by this. Assuming "sane" input file sizes (under a million lines each) it's a few megabytes maximum... Chris

On the Git mailing list, Ezekiel Newren wrote (reply to this):

On Tue, Oct 21, 2025 at 4:03 AM Phillip Wood <phillip.wood123@gmail.com> wrote: > > Hi Ezekiel > > On 21/10/2025 00:29, Ezekiel Newren wrote: > > On Wed, Oct 15, 2025 at 3:18 PM Ezekiel Newren via GitGitGadget > > <gitgitgadget@gmail.com> wrote: > >> > >> From: Ezekiel Newren <ezekielnewren@gmail.com> > >> > >> The ha field is serving two different purposes, which makes the code > >> harder to read. At first glance it looks like many places assume > >> there could never be hash collisions between lines of the two input > >> files. In reality, line_hash is used together with xdl_recmatch() to > >> ensure correct comparisons of lines, even when collisions occur. > >> > >> To make this clearer, the old ha field has been split: > >> * line_hash: The straightforward hash of a line, requiring no > >> additional context. > >> * minimal_perfect_hash: Not a new concept, but now a separate > >> field. It comes from the classifier's general-purpose hash table, > >> which assigns each line a unique and minimal hash across the two > >> files. > >> > >> Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com> > > > > I'm a bit surprised that nobody has commented on this patch. > > I've been off the list and I haven't caught up with this series yet. > > > I thought > > that someone would have criticized the length of the name > > "minimal_perfect_hash" or asked me why I was splitting one field into > > two. > > I think "perfect_hash" would be fine if we want a shorter name. More > importantly it would be helpful to explain why the two fields have > different types. I assume it is because the perfect_hash is used as an > array index and therefore size_t is a better match for rust's usize than > uint64_t. Your understanding is correct. line_hash is fixed width while minimal_perfect_hash is meant to be used as an array index into memory. I'll update my commit message to make this more clear. > How much more memory do we end up using by adding second hash > member to the struct? If the aim is to show that only one of them is > used at a time then a union might be more appropriate but I doubt that > plays well with rust. xrecord_t used to be defined with a pointer, so we're at the same size. But more importantly I plan on splitting minimal_perfect_hash out of xrecord_t into its own array. I think the diff algorithms end up being a little bit faster with a separate array because each element is only 8 bytes instead of 32. In v2.51.0: typedef struct s_xrecord { struct s_xrecord *next; char const *ptr; long size; unsigned long ha; } xrecord_t; This patch series: typedef struct s_xrecord { uint8_t const *ptr; size_t size; uint64_t line_hash; size_t minimal_perfect_hash; } xrecord_t; > I'll try and have a look at the other patches later this week. I think > the type changes are going to need careful review. I appreciate the careful review. I figured it would be best to limit the scope of this patch series to type changes, so that it wasn't bogged down by other stuff.

gitgitgadget-git · 2025-10-21T08:38:12Z

xdiff/xdiffi.c

 static int get_indent(xrecord_t *rec)
 {
 	long i;
 	int ret = 0;


On the Git mailing list, Patrick Steinhardt wrote (reply to this):

On Wed, Oct 15, 2025 at 09:18:14PM +0000, Ezekiel Newren via GitGitGadget wrote: > diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c > index 6f3998ee54..411a8aa69f 100644 > --- a/xdiff/xdiffi.c > +++ b/xdiff/xdiffi.c > @@ -993,11 +993,11 @@ static void xdl_mark_ignorable_lines(xdchange_t *xscr, xdfenv_t *xe, long flags) > > rec = &xe->xdf1.recs[xch->i1]; > for (i = 0; i < xch->chg1 && ignore; i++) > - ignore = xdl_blankline(rec[i].ptr, rec[i].size, flags); > + ignore = xdl_blankline((const char *)rec[i].ptr, rec[i].size, flags); > > rec = &xe->xdf2.recs[xch->i2]; > for (i = 0; i < xch->chg2 && ignore; i++) > - ignore = xdl_blankline(rec[i].ptr, rec[i].size, flags); > + ignore = xdl_blankline((const char *)rec[i].ptr, rec[i].size, flags); > > xch->ignore = ignore; > } Okay. Seemingly, we convert the structure itself, but we don't convert any of the functions to accept an `uint8_t`. I guess you drew the line here so that we don't have to also touch up dozens of function signatures? And how did you end up verifying that you added all casts? Does the compiler flag those as warnings? In any case, it might be nice to explain both of these details in the commit message. Patrick

On the Git mailing list, Ezekiel Newren wrote (reply to this):

On Tue, Oct 21, 2025 at 2:33 AM Patrick Steinhardt <ps@pks.im> wrote: > > On Wed, Oct 15, 2025 at 09:18:14PM +0000, Ezekiel Newren via GitGitGadget wrote: > > diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c > > index 6f3998ee54..411a8aa69f 100644 > > --- a/xdiff/xdiffi.c > > +++ b/xdiff/xdiffi.c > > @@ -993,11 +993,11 @@ static void xdl_mark_ignorable_lines(xdchange_t *xscr, xdfenv_t *xe, long flags) > > > > rec = &xe->xdf1.recs[xch->i1]; > > for (i = 0; i < xch->chg1 && ignore; i++) > > - ignore = xdl_blankline(rec[i].ptr, rec[i].size, flags); > > + ignore = xdl_blankline((const char *)rec[i].ptr, rec[i].size, flags); > > > > rec = &xe->xdf2.recs[xch->i2]; > > for (i = 0; i < xch->chg2 && ignore; i++) > > - ignore = xdl_blankline(rec[i].ptr, rec[i].size, flags); > > + ignore = xdl_blankline((const char *)rec[i].ptr, rec[i].size, flags); > > > > xch->ignore = ignore; > > } > > Okay. Seemingly, we convert the structure itself, but we don't convert > any of the functions to accept an `uint8_t`. I guess you drew the line > here so that we don't have to also touch up dozens of function > signatures? That is correct. I wanted to avoid _boiling the ocean_ just to change the type of ptr. > And how did you end up verifying that you added all casts? Does the > compiler flag those as warnings? I used CLion to search for all uses of that field and then added casts where the types differ. Another way to do that is to run `make DEVELOPER=1` and address all of the `uint8_t differs in signedness from char` errors that are spat out. > In any case, it might be nice to explain both of these details in the > commit message. I will update it. Thanks.

gitgitgadget-git · 2025-10-21T08:38:14Z

gitgitgadget-git · 2025-10-21T08:38:36Z

xdiff-interface.c

 		xecfg->find_func_priv = NULL;
 	}
 }



On the Git mailing list, Patrick Steinhardt wrote (reply to this):

On Wed, Oct 15, 2025 at 09:18:16PM +0000, Ezekiel Newren via GitGitGadget wrote: > From: Ezekiel Newren <ezekielnewren@gmail.com> This should have a commit message explaining what exactly you're doing here. Patrick

On the Git mailing list, Ezekiel Newren wrote (reply to this):

On Tue, Oct 21, 2025 at 2:33 AM Patrick Steinhardt <ps@pks.im> wrote: > > On Wed, Oct 15, 2025 at 09:18:16PM +0000, Ezekiel Newren via GitGitGadget wrote: > > From: Ezekiel Newren <ezekielnewren@gmail.com> > > This should have a commit message explaining what exactly you're doing > here. I thought I did have a commit message justifying my changes. Maybe it got deleted through a rebase. How about a message like: Convert the function signature and body to use unambiguous types. char is changed to uint8_t because this function processes bytes in memory. unsigned long to uint64_t so that the hash output is consistent across platforms. `flags` was changed from long to uint64_t to ensure the high order bits are not dropped on platforms that treat long as 32 bits.

On the Git mailing list, Patrick Steinhardt wrote (reply to this):

On Wed, Oct 22, 2025 at 03:20:32PM -0600, Ezekiel Newren wrote: > On Tue, Oct 21, 2025 at 2:33 AM Patrick Steinhardt <ps@pks.im> wrote: > > > > On Wed, Oct 15, 2025 at 09:18:16PM +0000, Ezekiel Newren via GitGitGadget wrote: > > > From: Ezekiel Newren <ezekielnewren@gmail.com> > > > > This should have a commit message explaining what exactly you're doing > > here. > > I thought I did have a commit message justifying my changes. Maybe it > got deleted through a rebase. How about a message like: > > Convert the function signature and body to use unambiguous types. char > is changed to uint8_t because this function processes bytes in memory. > unsigned long to uint64_t so that the hash output is consistent across > platforms. `flags` was changed from long to uint64_t to ensure the > high order bits are not dropped on platforms that treat long as 32 > bits. Works for me, I guess. Thanks! Patrick

gitgitgadget-git · 2025-10-21T08:39:08Z

xdiff/xtypes.h

 } xrecord_t;

 typedef struct s_xdfile {
 	xrecord_t *recs;


On the Git mailing list, Patrick Steinhardt wrote (reply to this):

On Wed, Oct 15, 2025 at 09:18:20PM +0000, Ezekiel Newren via GitGitGadget wrote: > From: Ezekiel Newren <ezekielnewren@gmail.com> > > rindex describes a index offset which means it's an index into memory > which should use size_t. dstart and dend will be deleted in a future > patch series. Move them to the end to help avoid refactor conflicts. In a patch like this I would appreciate some explanation why we can change the type without adapting any of its users. So basically explain why this refactoring is safe to do and won't cause any issues. Patrick

On the Git mailing list, Ezekiel Newren wrote (reply to this):

On Tue, Oct 21, 2025 at 2:34 AM Patrick Steinhardt <ps@pks.im> wrote: > > On Wed, Oct 15, 2025 at 09:18:20PM +0000, Ezekiel Newren via GitGitGadget wrote: > > From: Ezekiel Newren <ezekielnewren@gmail.com> > > > > rindex describes a index offset which means it's an index into memory > > which should use size_t. dstart and dend will be deleted in a future > > patch series. Move them to the end to help avoid refactor conflicts. > > In a patch like this I would appreciate some explanation why we can > change the type without adapting any of its users. So basically explain > why this refactoring is safe to do and won't cause any issues. The values of rindex are only used in 3 places. get_hash() which was created in [1]. and 2 places in xdl_recs_cmp(). All of them use rindex as an index into another array directly so there's no cascading refactor impact. get_hash() was created precisely to reduce refactor churn. How about a commit message like: Changing the type of rindex from long to size_t has no cascading refactor impact because it is only ever used to directly index other arrays. [1] create get_hash() https://lore.kernel.org/git/637d1032abbd33b7673d3c101267816fbf1a343c.1758926520.git.gitgitgadget@gmail.com/

On the Git mailing list, Patrick Steinhardt wrote (reply to this):

On Wed, Oct 22, 2025 at 04:14:42PM -0600, Ezekiel Newren wrote: > On Tue, Oct 21, 2025 at 2:34 AM Patrick Steinhardt <ps@pks.im> wrote: > > > > On Wed, Oct 15, 2025 at 09:18:20PM +0000, Ezekiel Newren via GitGitGadget wrote: > > > From: Ezekiel Newren <ezekielnewren@gmail.com> > > > > > > rindex describes a index offset which means it's an index into memory > > > which should use size_t. dstart and dend will be deleted in a future > > > patch series. Move them to the end to help avoid refactor conflicts. > > > > In a patch like this I would appreciate some explanation why we can > > change the type without adapting any of its users. So basically explain > > why this refactoring is safe to do and won't cause any issues. > > The values of rindex are only used in 3 places. get_hash() which was > created in [1]. and 2 places in xdl_recs_cmp(). All of them use rindex > as an index into another array directly so there's no cascading > refactor impact. get_hash() was created precisely to reduce refactor > churn. How about a commit message like: > > Changing the type of rindex from long to size_t has no cascading > refactor impact because it is only ever used to directly index other > arrays. Sounds good to me, thanks! Patrick

gitgitgadget-git · 2025-10-21T10:46:25Z

gitgitgadget-git · 2025-10-21T11:30:14Z

gitgitgadget-git · 2025-10-21T11:42:14Z

xdiff/xtypes.h

 	unsigned long ha;
 } xrecord_t;

 typedef struct s_xdfile {


On the Git mailing list, Phillip Wood wrote (reply to this):

On 15/10/2025 22:18, Ezekiel Newren via GitGitGadget wrote: > From: Ezekiel Newren <ezekielnewren@gmail.com> > > ssize_t is appropriate for dstart and dend because they both describe > positive or negative offsets relative to a pointer. Isn't ptrdiff_t the appropriate type for an offset to a pointer? ssize_t is not guaranteed to be the same width as size_t (this has caused problems in the past[1]) and is only defined by POSIX, not the C standard. Thanks Phillip [1] https://lore.kernel.org/git/loom.20150207T174514-727@post.gmane.org/ > A future patch will move these fields to a different struct. Moving > them to the end of xdfile_t now, means the field order of xdfile_t will > be disturbed less. > > Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com> > --- > xdiff/xtypes.h | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h > index f145abba3e..3514bb1684 100644 > --- a/xdiff/xtypes.h > +++ b/xdiff/xtypes.h > @@ -47,10 +47,10 @@ typedef struct s_xrecord { > typedef struct s_xdfile { > xrecord_t *recs; > long nrec; > - long dstart, dend; > bool *changed; > long *rindex; > long nreff; > + ssize_t dstart, dend; > } xdfile_t; > > typedef struct s_xdfenv {

On the Git mailing list, Junio C Hamano wrote (reply to this):

Phillip Wood <phillip.wood123@gmail.com> writes: > On 15/10/2025 22:18, Ezekiel Newren via GitGitGadget wrote: >> From: Ezekiel Newren <ezekielnewren@gmail.com> >> >> ssize_t is appropriate for dstart and dend because they both describe >> positive or negative offsets relative to a pointer. > > Isn't ptrdiff_t the appropriate type for an offset to a pointer? ssize_t > is not guaranteed to be the same width as size_t (this has caused > problems in the past[1]) and is only defined by POSIX, not the C standard. > > Thanks > > Phillip > > [1] https://lore.kernel.org/git/loom.20150207T174514-727@post.gmane.org/ Thanks for bringing up a very good point. We often consider that a function that yields what we would normally put in a size_t variable, when we _know_ that the return value would not be so big to exceed half the range of size_t, can instead return ssize_t and use the negative half of the range to signal error conditions, but as the cited incident shows that it is an easy mistake to make.

On the Git mailing list, Ezekiel Newren wrote (reply to this):

On Tue, Oct 21, 2025 at 11:18 AM Junio C Hamano <gitster@pobox.com> wrote: > > Phillip Wood <phillip.wood123@gmail.com> writes: > > > On 15/10/2025 22:18, Ezekiel Newren via GitGitGadget wrote: > >> From: Ezekiel Newren <ezekielnewren@gmail.com> > >> > >> ssize_t is appropriate for dstart and dend because they both describe > >> positive or negative offsets relative to a pointer. > > > > Isn't ptrdiff_t the appropriate type for an offset to a pointer? ssize_t > > is not guaranteed to be the same width as size_t (this has caused > > problems in the past[1]) and is only defined by POSIX, not the C standard. > > > > Thanks > > > > Phillip > > > > [1] https://lore.kernel.org/git/loom.20150207T174514-727@post.gmane.org/ > > Thanks for bringing up a very good point. > > We often consider that a function that yields what we would normally > put in a size_t variable, when we _know_ that the return value would > not be so big to exceed half the range of size_t, can instead return > ssize_t and use the negative half of the range to signal error > conditions, but as the cited incident shows that it is an easy > mistake to make. In my compat/rust_types.h file (which was dropped) I defined isize using ptrdiff_t rather than ssize_t. Maybe that file should be revived so that we don't have confusion in code reviews when structs are being expressly converted for the purpose of Rust FFI? I'd really like to bring that file back so that everyone has a clear reference for how C types map to Rust, but no one seemed to like it except me. Maybe it should be an adoc file rather than a header? [1] compat/rust_types.h https://lore.kernel.org/git/2a7d5b05c18d4a96f1905b7043d47c62d367cd2a.1757274320.git.gitgitgadget@gmail.com/

On the Git mailing list, Junio C Hamano wrote (reply to this):

Ezekiel Newren <ezekielnewren@gmail.com> writes: > In my compat/rust_types.h file (which was dropped) I defined isize > using ptrdiff_t rather than ssize_t. Maybe that file should be revived > so that we don't have confusion in code reviews when structs are being > expressly converted for the purpose of Rust FFI? I'd really like to > bring that file back so that everyone has a clear reference for how C > types map to Rust, but no one seemed to like it except me. Maybe it > should be an adoc file rather than a header? I may be mistaken, but I thought that the latest agreement was to use conceptually the "same" type in each language, have each language call that type in its native way, and if needed convert at the FFI boundary. So if we agree to use, for example, 64-bit signed integer type for counting things plus returning error conditions via negative values, maybe C-side can agree to use i64 for it, without having to worry about how that thing is called in Rust side. I am not sure in what way <compat/rust_types.h> should be used, and perhaps a documentation file may be sufficient as you suggest, but in any case, I agree that it should be made clear to everybody what C-types are to be mapped to what Rust types and vice versa, and if some C-types have no corresponding Rust type in that mapping, or if some Rust types have no corresponding C-type, that type needs to be converted before they reach the FFI boundary.

On the Git mailing list, Ezekiel Newren wrote (reply to this):

On Wed, Oct 22, 2025 at 3:38 PM Junio C Hamano <gitster@pobox.com> wrote: > > Ezekiel Newren <ezekielnewren@gmail.com> writes: > > > In my compat/rust_types.h file (which was dropped) I defined isize > > using ptrdiff_t rather than ssize_t. Maybe that file should be revived > > so that we don't have confusion in code reviews when structs are being > > expressly converted for the purpose of Rust FFI? I'd really like to > > bring that file back so that everyone has a clear reference for how C > > types map to Rust, but no one seemed to like it except me. Maybe it > > should be an adoc file rather than a header? > > I may be mistaken, but I thought that the latest agreement was to > use conceptually the "same" type in each language, have each > language call that type in its native way, and if needed convert at > the FFI boundary. So if we agree to use, for example, 64-bit signed > integer type for counting things plus returning error conditions via > negative values, maybe C-side can agree to use i64 for it, without > having to worry about how that thing is called in Rust side. Your understanding is correct. Would Documentation/unambiguous_types.adoc be an appropriate place for this documentation? > I am not sure in what way <compat/rust_types.h> should be used, and > perhaps a documentation file may be sufficient as you suggest, but > in any case, I agree that it should be made clear to everybody what > C-types are to be mapped to what Rust types and vice versa, and if > some C-types have no corresponding Rust type in that mapping, or if > some Rust types have no corresponding C-type, that type needs to be > converted before they reach the FFI boundary. Alright. I guess I'll drop the idea of compat/rust_types.h permanently.

gitgitgadget-git · 2025-10-21T13:43:17Z

xdiff/xdiffi.c

 static int get_indent(xrecord_t *rec)
 {
 	long i;
 	int ret = 0;


On the Git mailing list, Phillip Wood wrote (reply to this):

On 15/10/2025 22:18, Ezekiel Newren via GitGitGadget wrote: > From: Ezekiel Newren <ezekielnewren@gmail.com> > > Rust uses u8 to refer to bytes in memory. Since xrecord_t.ptr is also > referring to bytes in memory, rather than unicode code points, use > uint8_t instead of char. It C "char" never refers to a unicode code point so I don't follow the reasoning here. Isn't the reason you want to change from "char" to "uint8_t" to match rust? Given "char" and "uint8_t" are the same width why can't we use "char" in the C struct and "u8" in the rust struct as the two structs would still have the same layout? I agree with Patrick's comments on this patch - it would be nice to know how you decided where to add casts. Given that rust is going to be optional for at least a year we should take care to leave the C code in good shape with a minimum number of casts. Thanks Phillip > Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com> > --- > xdiff/xdiffi.c | 8 ++++---- > xdiff/xemit.c | 6 +++--- > xdiff/xmerge.c | 14 +++++++------- > xdiff/xpatience.c | 2 +- > xdiff/xprepare.c | 8 ++++---- > xdiff/xtypes.h | 2 +- > xdiff/xutils.c | 4 ++-- > 7 files changed, 22 insertions(+), 22 deletions(-) > > diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c > index 6f3998ee54..411a8aa69f 100644 > --- a/xdiff/xdiffi.c > +++ b/xdiff/xdiffi.c > @@ -407,7 +407,7 @@ static int get_indent(xrecord_t *rec) > int ret = 0; > > for (i = 0; i < rec->size; i++) { > - char c = rec->ptr[i]; > + uint8_t c = rec->ptr[i]; > > if (!XDL_ISSPACE(c)) > return ret; > @@ -993,11 +993,11 @@ static void xdl_mark_ignorable_lines(xdchange_t *xscr, xdfenv_t *xe, long flags) > > rec = &xe->xdf1.recs[xch->i1]; > for (i = 0; i < xch->chg1 && ignore; i++) > - ignore = xdl_blankline(rec[i].ptr, rec[i].size, flags); > + ignore = xdl_blankline((const char *)rec[i].ptr, rec[i].size, flags); > > rec = &xe->xdf2.recs[xch->i2]; > for (i = 0; i < xch->chg2 && ignore; i++) > - ignore = xdl_blankline(rec[i].ptr, rec[i].size, flags); > + ignore = xdl_blankline((const char *)rec[i].ptr, rec[i].size, flags); > > xch->ignore = ignore; > } > @@ -1008,7 +1008,7 @@ static int record_matches_regex(xrecord_t *rec, xpparam_t const *xpp) { > size_t i; > > for (i = 0; i < xpp->ignore_regex_nr; i++) > - if (!regexec_buf(xpp->ignore_regex[i], rec->ptr, rec->size, 1, > + if (!regexec_buf(xpp->ignore_regex[i], (const char *)rec->ptr, rec->size, 1, > &regmatch, 0)) > return 1; > > diff --git a/xdiff/xemit.c b/xdiff/xemit.c > index b2f1f30cd3..ead930088a 100644 > --- a/xdiff/xemit.c > +++ b/xdiff/xemit.c > @@ -27,7 +27,7 @@ static int xdl_emit_record(xdfile_t *xdf, long ri, char const *pre, xdemitcb_t * > { > xrecord_t *rec = &xdf->recs[ri]; > > - if (xdl_emit_diffrec(rec->ptr, rec->size, pre, strlen(pre), ecb) < 0) > + if (xdl_emit_diffrec((char const *)rec->ptr, rec->size, pre, strlen(pre), ecb) < 0) > return -1; > > return 0; > @@ -113,8 +113,8 @@ static long match_func_rec(xdfile_t *xdf, xdemitconf_t const *xecfg, long ri, > xrecord_t *rec = &xdf->recs[ri]; > > if (!xecfg->find_func) > - return def_ff(rec->ptr, rec->size, buf, sz); > - return xecfg->find_func(rec->ptr, rec->size, buf, sz, xecfg->find_func_priv); > + return def_ff((const char *)rec->ptr, rec->size, buf, sz); > + return xecfg->find_func((const char *)rec->ptr, rec->size, buf, sz, xecfg->find_func_priv); > } > > static int is_func_rec(xdfile_t *xdf, xdemitconf_t const *xecfg, long ri) > diff --git a/xdiff/xmerge.c b/xdiff/xmerge.c > index fd600cbb5d..75cb3e76a2 100644 > --- a/xdiff/xmerge.c > +++ b/xdiff/xmerge.c > @@ -101,8 +101,8 @@ static int xdl_merge_cmp_lines(xdfenv_t *xe1, int i1, xdfenv_t *xe2, int i2, > xrecord_t *rec2 = xe2->xdf2.recs + i2; > > for (i = 0; i < line_count; i++) { > - int result = xdl_recmatch(rec1[i].ptr, rec1[i].size, > - rec2[i].ptr, rec2[i].size, flags); > + int result = xdl_recmatch((const char *)rec1[i].ptr, rec1[i].size, > + (const char *)rec2[i].ptr, rec2[i].size, flags); > if (!result) > return -1; > } > @@ -324,8 +324,8 @@ static int xdl_fill_merge_buffer(xdfenv_t *xe1, const char *name1, > > static int recmatch(xrecord_t *rec1, xrecord_t *rec2, unsigned long flags) > { > - return xdl_recmatch(rec1->ptr, rec1->size, > - rec2->ptr, rec2->size, flags); > + return xdl_recmatch((const char *)rec1->ptr, rec1->size, > + (const char *)rec2->ptr, rec2->size, flags); > } > > /* > @@ -382,10 +382,10 @@ static int xdl_refine_conflicts(xdfenv_t *xe1, xdfenv_t *xe2, xdmerge_t *m, > * we have a very simple mmfile structure. > */ > t1.ptr = (char *)xe1->xdf2.recs[m->i1].ptr; > - t1.size = xe1->xdf2.recs[m->i1 + m->chg1 - 1].ptr > + t1.size = (char *)xe1->xdf2.recs[m->i1 + m->chg1 - 1].ptr > + xe1->xdf2.recs[m->i1 + m->chg1 - 1].size - t1.ptr; > t2.ptr = (char *)xe2->xdf2.recs[m->i2].ptr; > - t2.size = xe2->xdf2.recs[m->i2 + m->chg2 - 1].ptr > + t2.size = (char *)xe2->xdf2.recs[m->i2 + m->chg2 - 1].ptr > + xe2->xdf2.recs[m->i2 + m->chg2 - 1].size - t2.ptr; > if (xdl_do_diff(&t1, &t2, xpp, &xe) < 0) > return -1; > @@ -440,7 +440,7 @@ static int line_contains_alnum(const char *ptr, long size) > static int lines_contain_alnum(xdfenv_t *xe, int i, int chg) > { > for (; chg; chg--, i++) > - if (line_contains_alnum(xe->xdf2.recs[i].ptr, > + if (line_contains_alnum((const char *)xe->xdf2.recs[i].ptr, > xe->xdf2.recs[i].size)) > return 1; > return 0; > diff --git a/xdiff/xpatience.c b/xdiff/xpatience.c > index 669b653580..bb61354f22 100644 > --- a/xdiff/xpatience.c > +++ b/xdiff/xpatience.c > @@ -121,7 +121,7 @@ static void insert_record(xpparam_t const *xpp, int line, struct hashmap *map, > return; > map->entries[index].line1 = line; > map->entries[index].hash = record->ha; > - map->entries[index].anchor = is_anchor(xpp, map->env->xdf1.recs[line - 1].ptr); > + map->entries[index].anchor = is_anchor(xpp, (const char *)map->env->xdf1.recs[line - 1].ptr); > if (!map->first) > map->first = map->entries + index; > if (map->last) { > diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c > index 192334f1b7..4cb18b2b88 100644 > --- a/xdiff/xprepare.c > +++ b/xdiff/xprepare.c > @@ -99,8 +99,8 @@ static int xdl_classify_record(unsigned int pass, xdlclassifier_t *cf, xrecord_t > hi = (long) XDL_HASHLONG(rec->ha, cf->hbits); > for (rcrec = cf->rchash[hi]; rcrec; rcrec = rcrec->next) > if (rcrec->rec.ha == rec->ha && > - xdl_recmatch(rcrec->rec.ptr, rcrec->rec.size, > - rec->ptr, rec->size, cf->flags)) > + xdl_recmatch((const char *)rcrec->rec.ptr, rcrec->rec.size, > + (const char *)rec->ptr, rec->size, cf->flags)) > break; > > if (!rcrec) { > @@ -156,8 +156,8 @@ static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, long narec, xpparam_ > if (XDL_ALLOC_GROW(xdf->recs, xdf->nrec + 1, narec)) > goto abort; > crec = &xdf->recs[xdf->nrec++]; > - crec->ptr = prev; > - crec->size = (long) (cur - prev); > + crec->ptr = (uint8_t const *)prev; > + crec->size =(long) ( cur - prev); > crec->ha = hav; > if (xdl_classify_record(pass, cf, crec) < 0) > goto abort; > diff --git a/xdiff/xtypes.h b/xdiff/xtypes.h > index 3514bb1684..57983627f5 100644 > --- a/xdiff/xtypes.h > +++ b/xdiff/xtypes.h > @@ -39,7 +39,7 @@ typedef struct s_chastore { > } chastore_t; > > typedef struct s_xrecord { > - char const *ptr; > + uint8_t const *ptr; > long size; > unsigned long ha; > } xrecord_t; > diff --git a/xdiff/xutils.c b/xdiff/xutils.c > index 447e66c719..7be063bfb6 100644 > --- a/xdiff/xutils.c > +++ b/xdiff/xutils.c > @@ -465,10 +465,10 @@ int xdl_fall_back_diff(xdfenv_t *diff_env, xpparam_t const *xpp, > xdfenv_t env; > > subfile1.ptr = (char *)diff_env->xdf1.recs[line1 - 1].ptr; > - subfile1.size = diff_env->xdf1.recs[line1 + count1 - 2].ptr + > + subfile1.size = (char *)diff_env->xdf1.recs[line1 + count1 - 2].ptr + > diff_env->xdf1.recs[line1 + count1 - 2].size - subfile1.ptr; > subfile2.ptr = (char *)diff_env->xdf2.recs[line2 - 1].ptr; > - subfile2.size = diff_env->xdf2.recs[line2 + count2 - 2].ptr + > + subfile2.size = (char *)diff_env->xdf2.recs[line2 + count2 - 2].ptr + > diff_env->xdf2.recs[line2 + count2 - 2].size - subfile2.ptr; > if (xdl_do_diff(&subfile1, &subfile2, xpp, &env) < 0) > return -1;

On the Git mailing list, Junio C Hamano wrote (reply to this):

Phillip Wood <phillip.wood123@gmail.com> writes: > It C "char" never refers to a unicode code point so I don't follow the > reasoning here. Isn't the reason you want to change from "char" to > "uint8_t" to match rust? Given "char" and "uint8_t" are the same width > why can't we use "char" in the C struct and "u8" in the rust struct as > the two structs would still have the same layout? And forcing u8 makes sure both sides of the ffi agrees on the signedness (C "char"'s signedness is implementation defined), which is a good thing. I 100% agree that being honest about the motivation to sell this change would be a good thing to do here. I do not think "in this series, I want to match the types used at the interface to be of Rust's" is a position to be ashamed of ;-) > I agree with Patrick's comments on this patch - it would be nice to know > how you decided where to add casts. Given that rust is going to be > optional for at least a year we should take care to leave the C code in > good shape with a minimum number of casts. Thanks.

On the Git mailing list, Phillip Wood wrote (reply to this):

On 21/10/2025 19:15, Junio C Hamano wrote: > Phillip Wood <phillip.wood123@gmail.com> writes: > >> It C "char" never refers to a unicode code point so I don't follow the >> reasoning here. Isn't the reason you want to change from "char" to >> "uint8_t" to match rust? Given "char" and "uint8_t" are the same width >> why can't we use "char" in the C struct and "u8" in the rust struct as >> the two structs would still have the same layout? > > And forcing u8 makes sure both sides of the ffi agrees on the > signedness (C "char"'s signedness is implementation defined), > which is a good thing. That's true and ignoring the signedness would be hacky but I'm not sure it matters in practice. Both C and rust would use the same bit patterns for "abc" and b"abc\0" and in general C plays fast and loose with the signedness of variables all over the place. The trade off for respecting the signedness is that we either have casts all over the place or massive churn converting the rest of the code to use uint8_t. This problem isn't limited to xdiff, it will be true wherever we share bytestrings such as the contents of objects between C and rust as we tend to use char rather than uint8_t in our code. Thanks Phillip > I 100% agree that being honest about the motivation to sell this > change would be a good thing to do here. I do not think "in this > series, I want to match the types used at the interface to be of > Rust's" is a position to be ashamed of ;-) > >> I agree with Patrick's comments on this patch - it would be nice to know >> how you decided where to add casts. Given that rust is going to be >> optional for at least a year we should take care to leave the C code in >> good shape with a minimum number of casts. > > Thanks.

On the Git mailing list, Ezekiel Newren wrote (reply to this):

On Wed, Oct 22, 2025 at 7:27 AM Phillip Wood <phillip.wood123@gmail.com> wrote: > > I 100% agree that being honest about the motivation to sell this > > change would be a good thing to do here. I do not think "in this > > series, I want to match the types used at the interface to be of > > Rust's" is a position to be ashamed of ;-) > > > >> I agree with Patrick's comments on this patch - it would be nice to know > >> how you decided where to add casts. Given that rust is going to be > >> optional for at least a year we should take care to leave the C code in > >> good shape with a minimum number of casts. > > > > Thanks. I'm not arguing that uint8_t should be used everywhere in Git, only that it is used everywhere in xdiff. xrecord_t and xdfile_t are fundamental to how xdiff passes data around and they need to be transparent to both sides. I'm trying to leave the rest of the data structures alone in order to avoid refactor churn. Refactoring C to use unambiguous types, outside of xdiff, is outside the scope of this patch series. Another problem with using char instead of uint8_t is that tools like cbindgen and bindgen don't translate char to u8. Bindgen will see char and will produce std::ffi::c_char on the Rust side, see [1] for why that's a problem. The other way around is a problem too. When cbindgen sees u8 it will generate uint8_t on the C side and then `make DEVELOPER=1` won't compile because uint8_t and char differer in signedness. [1] Problems with C types https://lore.kernel.org/git/CAH=ZcbA_8JM1hdUAfFe3ho0ShuniguEpV1308S0nCkCHOCsmmg@mail.gmail.com/

gitgitgadget-git · 2025-10-21T13:43:31Z

gitgitgadget-git · 2025-10-21T13:43:46Z

gitgitgadget-git · 2025-10-21T23:15:06Z

gitgitgadget-git · 2025-10-22T02:45:22Z

gitgitgadget-git · 2025-10-22T22:11:12Z

ezekielnewren added 9 commits October 15, 2025 15:15

xdiff: make xrecord_t.ptr a uint8_t instead of char

7b9e896

Rust uses u8 to refer to bytes in memory. Since xrecord_t.ptr is also referring to bytes in memory, rather than unicode code points, use uint8_t instead of char. Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>

xdiff: use size_t for xrecord_t.size

ae15ed7

size_t is the appropriate type because size is describing the number of elements, bytes in this case, in memory. Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>

xdiff: use unambiguous types in xdl_hash_record()

7fcd83c

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>

xdiff: make xdfile_t.nrec a size_t instead of long

5767ba4

size_t is used because nrec describes the number of elements in memory for recs, and the number of elements in memory for 'changed' + 2. Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>

xdiff: make xdfile_t.nreff a size_t instead of long

4caa6a4

size_t is used because nreff describes the number of elements in memory for rindex. Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>

ezekielnewren force-pushed the xdiff_cleanup_part2 branch from 77cca9e to 518e5f5 Compare October 15, 2025 21:15

gitgitgadget-git bot added the seen label Oct 15, 2025

gitgitgadget-git bot reviewed Oct 16, 2025

View reviewed changes

gitgitgadget-git bot reviewed Oct 20, 2025

View reviewed changes

gitgitgadget-git bot reviewed Oct 21, 2025

View reviewed changes

Xdiff cleanup part2 #2070

Are you sure you want to change the base?

Xdiff cleanup part2 #2070

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels