Deprecated: Function get_magic_quotes_gpc() is deprecated in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 99

Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 619

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1169

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176
8000 feat(ptx): Implement -S (sentence-regexp) mode with dual-mode architecture and comprehensive test coverage by Misakait · Pull Request #8915 · uutils/coreutils · GitHub
Nothing Special   »   [go: up one dir, main page]

Skip to content

Conversation

Misakait
Copy link
Contributor
@Misakait Misakait commented Oct 15, 2025

This PR significantly enhances the ptx utility by implementing the -S (sentence-regexp) option and introducing a robust dual-mode architecture that can handle both traditional line-based processing and context-aware stream processing.

Architecture Overview

  • Dual-mode processing:
    • Line Mode: Traditional processing (when -G is used or context regex is \n)
    • Stream Mode: Context-aware processing using -S sentence regex for intelligent chunking
  • Flexible word indexing: Uses WordRef enum with LineWordRef and StreamWordRef variants
    optimized for each mode
  • GNU ptx compatibility: Implements key option interactions (-S) and improve GNU layout compatibility

Key Features Implemented

  1. -S, --sentence-regexp=REGEXP Support:
  • Custom context boundaries for intelligent keyword indexing
  • Stream mode treats entire file as single stream, splitting by sentence regex
  • Proper handling of context contraction when used with -r (references)
  1. Enhanced Option Interactions:
  • Ensures correct keyword filtering and context formatting when -r (references) is used with the new stream mode (-S).
  • Ensures the -W (word-regexp) option is correctly respected by all internal logic, including performance optimizations.
  • Maintains correct -A (auto-reference) line number generation in both line and stream modes.
  1. Advanced Output Formatting:
  • High-Fidelity Layout Replication: Replaced the previous layout logic with a precise, GNU-compatible implementation for head, tail, before, and after fields.
  • Replicates subtle GNU ptx layout quirks (e.g., the "-1" effective length for empty fields and the conservative, non-greedy wrapping algorithm) to achieve 100% output compatibility.
  • Optimized context handling: When left context is too long, intelligently skips to
    maintain compatibility
  • Proper truncation: Adds truncation markers when context is cut off
  1. Robust Testing:
  • Comprehensive test suite: Extensive test cases covering -S mode scenarios
  • GNU compatibility fixtures: Expected outputs matching GNU ptx behavior
  • Multi-file support: Tests with multiple input files in stream mode
  • Complex flag interactions: Tests for -S + -r, -S + -A, and other combinations

Motivation

The -S option is one of the most important features of GNU ptx, enabling intelligent
context-aware indexing beyond simple line-by-line processing. This implementation
provides:

  • Full -S mode compatibility with GNU ptx
  • Foundation for future GNU extensions support through the dual-mode architecture
  • Robust handling of complex text processing scenarios
  • Comprehensive test coverage ensuring reliability

Testing

All new functionality is covered by extensive tests with expected outputs matching GNU ptx exactly. The implementation has been verified against:

  • Stream mode with custom sentence regex (-S)
  • Traditional mode (-G) compatibility maintained
  • Word filtering with -W option
  • Reference handling with -r and -A flags
  • Multi-file processing in both modes
  • Complex flag combinations and edge cases

Breaking Changes

None. This enhances existing functionality while maintaining backward compatibility.

Related Issues

Copy link

GNU testsuite comparison:

Skip an intermittent issue tests/tail/overlay-headers (fails in this run but passes in the 'main' branch)

Copy link

GNU testsuite comparison:

Skipping an intermittent issue tests/timeout/timeout (passes in this run but fails in the 'main' branch)

@sylvestre
Copy link
Contributor

@Misakait please keep in mind that Humans are reviewing PR, not AI...

  • comment 0 isn't useful. please provide only relevant information
  • the ptx change is way too big. It needs to be split in different PR to be reviewed.
  • test data and references should be generated on the fly

Copy link
Contributor
@sylvestre sylvestre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

@Misakait
Copy link
Contributor Author

@Misakait please keep in mind that Humans are reviewing PR, not AI...

  • comment 0 isn't useful. please provide only relevant information
  • the ptx change is way too big. It needs to be split in different PR to be reviewed.
  • test data and references should be generated on the fly

Hi @sylvestre,

Thank you very much for your review and valuable feedback. I sincerely apologize for the size of this PR. I understand that large PRs place a significant burden on maintainers, and I struggled for a long time trying to split it into smaller pieces, but found it nearly impossible. The reason is that these changes are, by their nature, highly architecturally coupled.

In response to your first point, I'd like to explain my process and the dependency chain, and I would be extremely grateful for your advice on how to split this, or for any alternative refactoring path I should follow. As a newcomer, this is an area I find very challenging.

My primary goal was to implement the -S (sentence-regexp) mode.

To achieve this, I discovered that a series of tightly linked changes were necessary:

  1. Enabling Stream Processing: The first step was to introduce stream-based file reading. I modified the get_config function and the Config struct to accept the context_regex parameter.

  2. Refactoring the Core Pipeline: To support -S, the entire processing pipeline (read_input -> create_word_set -> write_traditional_output) had to be adapted.

    • In read_input, I added the logic to read the entire file content into a new full_content field within the FileContent struct, preparing it for create_word_set.

    • In create_word_set, I introduced the core dual-mode logic: a stream-mode path for when context_regex is set, and a line-mode path for the traditional behavior. This fundamental split required introducing the PtxResult struct and the WordRef enum with its LineWordRef and StreamWordRef variants.

  3. Decoupling Formatting Logic: Finally, write_traditional_output had to be refactored. The calculation of layout chunks (tail, before, etc.) was moved up from the format-specific functions (format_tex_line, etc.) into a new set of prepare_*_chunks functions. The dispatch to either prepare_line_chunks or the new prepare_stream_chunks is now determined by the WordRef variant.

This is essentially what my first commit does. My difficulty in splitting it is that these changes are an interdependent chain. A PR that only modifies read_input, for example, would introduce an unused field and leave the architecture in a broken state. All these changes must be applied together to be meaningful.

My second commit is dedicated to fixing the numerous compatibility bugs that the new architecture revealed. I could not bring myself to submit a PR full of bugs. This second commit primarily refactors the get_output_chunks function to precisely replicate GNU's behavior by implementing truncation detection, the left_field_start jump optimization, and compatibility quirks like the '-1' effective length for empty fields. It was only after this commit that the output finally matched the GNU version.

Regarding the on-the-fly test generation, you are absolutely right. I will work on modifying that.

Thank you for taking the time to read this long explanation. I have many questions and would be very grateful for any advice you can offer to help me move forward. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ptx: Implement context_regex (-S) and its default values

2 participants

0