Fri Jun 10 2022
How Linux file indexing and search works?
An indexed file is a computer file with an index that allows easy random access to any record given its file key. The key must be such that it uniquely identifies a record. If more than one index is present the other ones are called alternate indexes. The indexes are created with the file and maintained by the system.
The file systems that come with current operating systems help the user find specific files only if the user knows, or can figure out, the precise path names of the needed files. That is, the file system provides exactly one method of identifying files: a hierarchical path name from a unique root. To go with this basic capability are usually a number of minor tools. For example, in UNIX the ls command produces a list of names of files in a given directory, and for the dedicated hacker who is not intimidated by arcane syntax, the find command can explore the file hierarchy searching for things. If the user can get as close as the directory that contains the file, he may also be able to use programs such as grep as crude content-search tools. Other systems provide similar tools, but none are integrated with the file system itself. There is an alarming number of different special-purpose tools that people have developed to help find the right file in various specific contexts.
The best thing you could do is feed the text files into a MySQL database and use its Full Text matching system. This will give very rapid searches with rankings on how well the results match with the search.
Interfacing a MySQL database with other systems, such as a website for document searching, etc, would be a simple enough task.
Here are few simple commands that you can use to locate files
The which command is the simplest of the three commands we’re going to explore, but this simplicity comes at a cost: its usage is extremely narrow and specific. However, for what it’s meant to do, it’s very good at what it does.
On Linux, every command that you run in the command line actually points to a binary file (also known as an executable file) somewhere on the system. When you type a command, that command’s binary file is what ends up being executed. When you use the which command with a command you’re searching for, the output is the path to that command binary file.
An alternative command is the "whereis" command, which gives you a bit more information - not just the location of the command’s binary file, but the location of the command’s source files and man pages as well.
Find command, which is the most versatile of the commands but also the hardest to learn because of how flexible it can be. When you use this command, it will always search in the current directory unless specified otherwise. To find a file by its name, use the -name parameter (or the -iname parameter for case-insensitivity). Or, you could inverse the search and exclude files by their name using the -not modifier. The ( * ) symbol is used as a wildcard.
You can also find all files according to their type using the -type parameter. The following common options correspond to their respective file types -
f: regular files
l: symbolic links
Similar to file types, you can search according to file size using the -size parameter followed by a string that indicates the size, unit, and whether we want an exact, lower than, or greater than match -
Locate, which uses a pre-built database of files and directories to speed up the search process. This kind of indexed search is certainly faster than searching the entire disk drive, but the downside is that the index can sometimes fall out of sync. Though the Linux system periodically updates the index on its own, you can force it to sync using the "updatedb" command. To use the locate command, all you have to do is provide a query string that it will use for finding matches. The command will output a list of all indexed directories and files that match the query.
If you want to limit the search to exact matches only, use the -b parameter. If you want to make the search query case-insensitive, use the -i parameter. If you want to limit the number of results, use the -n <#> parameter.
Filesystem indexing and querying -
Plugins can extract text that is trapped in files for full-text indexing.
Unified interface for all data sources used as input for indexing - for example, the following are all indexable with libferris: text inside SleepyCat dbxml files, inside tarballs and in individual messages in mbox files or relational databases.
Metadata trapped inside files can be indexed and searched for.
Identical basic add/query commands for all indexing plugins, so you can switch between indexing implementations fairly easily.
Combination searches for full-text and extended attributes for your filesystem. The search tool allows you to combine many searches into one result set.
Ability to search for files based on the metadata they once had.
Ability to search for files based on Supervised Machine Learning (SML) judgments - spam filtering for your filesystem.