kernel-hacking-2024-linux-steffo

unimore/kernel-hacking-2024-linux-steffo

Author	SHA1	Message	Date
Herbert Xu	f10db6277d	Avoid divide in IS_ALIGN I was happy to discover the brand new IS_ALIGN macro and quickly used it in my code. To my dismay I found that the generated code used division to perform the test. This patch fixes it by changing the % test to an &. This avoids the division. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-06 10:41:04 -08:00
Eric Dumazet	b324215190	PERCPU : __percpu_alloc_mask() can dynamically size percpu_data storage Instead of allocating a fix sized array of NR_CPUS pointers for percpu_data, we can use nr_cpu_ids, which is generally < NR_CPUS. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-06 10:41:04 -08:00
Joern Engel	e7ca2d41a0	Document I_SYNC and I_DATASYNC After some archeology (see http://logfs.org/logfs/inode_state_bits) I finally figured out what the three I_DIRTY bits do. Maybe others would prefer less effort to reach this insight. Signed-off-by: Joern Engel <joern@logfs.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-06 10:41:03 -08:00
Jiri Olsa	0321155926	fs: remove dead config CONFIG_HAS_COMPAT_EPOLL_EVENT symbol Remove dead config CONFIG_HAS_COMPAT_EPOLL_EVENT symbol. Signed-off-by: Jiri Olsa <olsajiri@gmail.com> Cc: Davide Libenzi <davidel@xmailserver.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-06 10:41:03 -08:00
Robert P. J. Day	de9330d13e	log2.h: Define order_base_2() macro for convenience. Given a number of places in the tree that need to calculate this value explicitly, might as well just create a macro for it. (akpm: must be implemented as a macro for callee typeof() usage) Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-06 10:41:03 -08:00
Adrian Bunk	26464378c4	proper prototype for vty_init() Add a proper prototype for vty_init() in include/linux/vt_kern.h Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-06 10:41:03 -08:00
Adrian Bunk	011e3fcd1e	proper prototype for get_filesystem_list() Ad a proper prototype for migration_init() in include/linux/fs.h Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-06 10:41:02 -08:00
Adrian Bunk	a1c9eea9e5	proper prototype for signals_init() Add a proper prototype for signals_init() in include/linux/signal.h Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-06 10:41:02 -08:00
Andrew Morton	941e492bdb	read_current_timer() cleanups - All implementations can be __devinit - The function prototypes were in asm/timex.h but they all must be the same, so create a single declaration in linux/timex.h. - uninline the sparc64 version to match the other architectures - Don't bother #defining ARCH_HAS_READ_CURRENT_TIMER to a particular value. [ezk@cs.sunysb.edu: fix build] Cc: "David S. Miller" <davem@davemloft.net> Cc: Haavard Skinnemoen <hskinnemoen@atmel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-06 10:41:02 -08:00
Adrian Bunk	83bad1d764	scheduled OSS driver removal This patch contains the scheduled removal of OSS drivers whose config options have been removed in 2.6.23. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-06 10:41:02 -08:00
Adrian Bunk	f74596d079	proper show_interrupts() prototype Add a proper prototype for show_interrupts() in include/linux/interrupt.h Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-06 10:41:02 -08:00
David Woodhouse	96c5865559	Allow auto-destruction of loop devices This allows a flag to be set on loop devices so that when they are closed for the last time, they'll self-destruct. In general, so that we can automatically allocate loop devices (as with losetup -f) and have them disappear when we're done with them. In particular, right now, so that we can stop relying on the hackish special-case in umount(8) which kills off loop devices which were set up by 'mount -oloop'. That means we can stop putting crap in /etc/mtab which doesn't belong there, which means it can be a symlink to /proc/mounts, which means yet another writable file on the root filesystem is eliminated and the 'stateless' folks get happier... and OLPC trac #356 can be closed. The mount(8) side of that is at http://marc.info/?l=util-linux-ng&m=119362955431694&w=2 [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: David Woodhouse <dwmw2@infradead.org> Cc: Bernardo Innocenti <bernie@codewiz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-06 10:41:01 -08:00
Matthias Kaehlcke	0a5dcb5177	Parallel port: convert port_mutex to the mutex API Parallel port: Convert port_mutex to the mutex API [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Matthias Kaehlcke <matthias.kaehlcke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-06 10:41:01 -08:00
Matthew Wilcox	4e701482d1	hash: add explicit u32 and u64 versions of hash The 32-bit version is more efficient (and apparently gives better hash results than the 64-bit version), so users who are only hashing a 32-bit quantity can now opt to use the 32-bit version explicitly, rather than promoting to a long. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: William Lee Irwin III <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-06 10:41:00 -08:00
Bartlomiej Zolnierkiewicz	64a57fe439	ide: add ide_read_error() inline helper Acked-by: Sergei Shtylyov <sshtylyov@ru.mvista.com> Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>	2008-02-06 02:57:51 +01:00
Bartlomiej Zolnierkiewicz	c47137a99c	ide: add ide_read_[alt]status() inline helpers Acked-by: Sergei Shtylyov <sshtylyov@ru.mvista.com> Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>	2008-02-06 02:57:51 +01:00
Bartlomiej Zolnierkiewicz	29dd59755a	ide: remove ide_setup_ports() ide-cris.c: * Add cris_setup_ports() helper and use it instead of ide_setup_ports() (fixes random value being set in ->io_ports[IDE_IRQ_OFFSET]). buddha.c: * Add buddha_setup_ports() helper and use it instead of ide_setup_ports(). falconide.c: * Add falconide_setup_ports() helper and use it instead of ide_setup_ports(), also fix return value of falconide_init() while at it. gayle.c: * Add gayle_setup_ports() helper and use it instead of ide_setup_ports(). macide.c: * Add macide_setup_ports() helper and use it instead of ide_setup_ports() (fixes incorrect value being set in ->io_ports[IDE_IRQ_OFFSET]). q40ide.c: * Fix q40_ide_setup_ports() comments. ide.c: * Remove no longer needed ide_setup_ports(). Cc: Mikael Starvik <starvik@axis.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>	2008-02-06 02:57:50 +01:00
Bartlomiej Zolnierkiewicz	afdd360c95	ide: remove write-only ->sata_misc[] from ide_hwif_t * Remove write-only ->sata_misc[] from ide_hwif_t. * Remove no longer used SATA_{MISC,PHY,IEN}_OFFSET defines. Acked-by: Sergei Shtylyov <sshtylyov@ru.mvista.com> Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>	2008-02-06 02:57:50 +01:00
Anton Salnikov	7c7e92a926	Palmchip BK3710 IDE driver This is Palmchip BK3710 IDE controller support. The IDE controller logic supports PIO, MultiWord-DMA and Ultra-DMA modes. Supports interface to Compact Flash (CF) configured in True-IDE mode. Bart: - remove dead code - fix ide_hwif_setup_dma() build problem Signed-off-by: Anton Salnikov <asalnikov@ru.mvista.com> Reviewed-by: Alan Cox <alan@lxorguk.ukuu.org.uk> Reviewed-by: Sergei Shtylyov <sshtylyov@ru.mvista.com> Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>	2008-02-06 02:57:48 +01:00
Linus Torvalds	3d412f60b7	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (21 commits) [PKT_SCHED]: vlan tag match [NET]: Add if_addrlabel.h to sanitized headers. [NET] rtnetlink.c: remove no longer used functions [ICMP]: Restore pskb_pull calls in receive function [INET]: Fix accidentally broken inet(6)_hash_connect's port offset calculations. [NET]: Remove further references to net-modules.txt bluetooth rfcomm tty: destroy before tty_close() bluetooth: blacklist another Broadcom BCM2035 device drivers/bluetooth/btsdio.c: fix double-free drivers/bluetooth/bpa10x.c: fix memleak bluetooth: uninlining bluetooth: hidp_process_hid_control remove unnecessary parameter dealing tun: impossible to deassert IFF_ONE_QUEUE or IFF_NO_PI hamradio: fix dmascc section mismatch [SCTP]: Fix kernel panic while received AUTH chunk with BAD shared key identifier [SCTP]: Fix kernel panic while received AUTH chunk while enabled auth [IPV4]: Formatting fix for /proc/net/fib_trie. [IPV6]: Fix sysctl compilation error. [NET_SCHED]: Add #ifdef CONFIG_NET_EMATCH in net/sched/cls_flow.c (latest git broken build) [IPV4]: Fix compile error building without CONFIG_FS_PROC ...	2008-02-05 10:09:07 -08:00
Linus Torvalds	9914712e2e	Merge branch 'agp-patches' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/agp-2.6 * 'agp-patches' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/agp-2.6: agp: remove flush_agp_mappings calls from new flush handling code intel-agp: introduce IS_I915 and do some cleanups.. [intel_agp] fix name for G35 chipset intel-agp: fixup resource handling in flush code. intel-agp: add new chipset ID agp: remove unnecessary pci_dev_put agp: remove uid comparison as security check fix AGP warning agp/intel: Add chipset flushing support for i8xx chipsets. intel-agp: add chipset flushing support agp: add chipset flushing support to AGP interface	2008-02-05 09:54:10 -08:00
Finn Thain	57dfee7c3f	mac68k: add nubus card definitions and a typo fix Add some new card definitions and fix a typo (from Eugen Paiuc). Signed-off-by: Finn Thain <fthain@telegraphics.com.au> Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:24 -08:00
Rafael J. Wysocki	fa23f5cce8	leds: add possibility to remove leds classdevs during suspend/resume Make it possible to unregister a led classdev object in a safe way during a suspend/resume cycle. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Michael Buesch <mb@bu3sch.de> Cc: Pavel Machek <pavel@ucw.cz> Cc: "John W. Linville" <linville@tuxdriver.com> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Len Brown <lenb@kernel.org> Cc: Greg KH <greg@kroah.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Richard Purdie <rpurdie@rpsys.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:23 -08:00
Rafael J. Wysocki	a41e3dc406	HWRNG: add possibility to remove hwrng devices during suspend/resume Make it possible to unregister a Hardware Random Number Generator device object in a safe way during a suspend/resume cycle. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Michael Buesch <mb@bu3sch.de> Cc: Michael Buesch <mb@bu3sch.de> Cc: Pavel Machek <pavel@ucw.cz> Cc: "John W. Linville" <linville@tuxdriver.com> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Len Brown <lenb@kernel.org> Cc: Greg KH <greg@kroah.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Richard Purdie <rpurdie@rpsys.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:23 -08:00
Rafael J. Wysocki	533354d4ac	Misc: Add possibility to remove misc devices during suspend/resume Make it possible to unregister a misc device object in a safe way during a suspend/resume cycle. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Michael Buesch <mb@bu3sch.de> Cc: Pavel Machek <pavel@ucw.cz> Cc: "John W. Linville" <linville@tuxdriver.com> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Len Brown <lenb@kernel.org> Cc: Greg KH <greg@kroah.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Richard Purdie <rpurdie@rpsys.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:23 -08:00
Mark Gross	f011e2e2df	latency.c: use QoS infrastructure Replace latency.c use with pm_qos_params use. Signed-off-by: mark gross <mgross@linux.intel.com> Cc: "John W. Linville" <linville@tuxdriver.com> Cc: Len Brown <lenb@kernel.org> Cc: Jaroslav Kysela <perex@suse.cz> Cc: Takashi Iwai <tiwai@suse.de> Cc: Arjan van de Ven <arjan@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:22 -08:00
Mark Gross	d82b35186e	pm qos infrastructure and interface The following patch is a generalization of the latency.c implementation done by Arjan last year. It provides infrastructure for more than one parameter, and exposes a user mode interface for processes to register pm_qos expectations of processes. This interface provides a kernel and user mode interface for registering performance expectations by drivers, subsystems and user space applications on one of the parameters. Currently we have {cpu_dma_latency, network_latency, network_throughput} as the initial set of pm_qos parameters. The infrastructure exposes multiple misc device nodes one per implemented parameter. The set of parameters implement is defined by pm_qos_power_init() and pm_qos_params.h. This is done because having the available parameters being runtime configurable or changeable from a driver was seen as too easy to abuse. For each parameter a list of performance requirements is maintained along with an aggregated target value. The aggregated target value is updated with changes to the requirement list or elements of the list. Typically the aggregated target value is simply the max or min of the requirement values held in the parameter list elements. >From kernel mode the use of this interface is simple: pm_qos_add_requirement(param_id, name, target_value): Will insert a named element in the list for that identified PM_QOS parameter with the target value. Upon change to this list the new target is recomputed and any registered notifiers are called only if the target value is now different. pm_qos_update_requirement(param_id, name, new_target_value): Will search the list identified by the param_id for the named list element and then update its target value, calling the notification tree if the aggregated target is changed. with that name is already registered. pm_qos_remove_requirement(param_id, name): Will search the identified list for the named element and remove it, after removal it will update the aggregate target and call the notification tree if the target was changed as a result of removing the named requirement. >From user mode: Only processes can register a pm_qos requirement. To provide for automatic cleanup for process the interface requires the process to register its parameter requirements in the following way: To register the default pm_qos target for the specific parameter, the process must open one of /dev/[cpu_dma_latency, network_latency, network_throughput] As long as the device node is held open that process has a registered requirement on the parameter. The name of the requirement is "process_<PID>" derived from the current->pid from within the open system call. To change the requested target value the process needs to write a s32 value to the open device node. This translates to a pm_qos_update_requirement call. To remove the user mode request for a target value simply close the device node. [akpm@linux-foundation.org: fix warnings] [akpm@linux-foundation.org: fix build] [akpm@linux-foundation.org: fix build again] [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: mark gross <mgross@linux.intel.com> Cc: "John W. Linville" <linville@tuxdriver.com> Cc: Len Brown <lenb@kernel.org> Cc: Jaroslav Kysela <perex@suse.cz> Cc: Takashi Iwai <tiwai@suse.de> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Venki Pallipadi <venkatesh.pallipadi@intel.com> Cc: Adam Belay <abelay@novell.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:22 -08:00
Adrian Bunk	4ef7229ffa	make kernel_shutdown_prepare() static kernel_shutdown_prepare() can now become static. Signed-off-by: Adrian Bunk <bunk@kernel.org> Acked-by: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:22 -08:00
Casey Schaufler	e114e47377	Smack: Simplified Mandatory Access Control Kernel Smack is the Simplified Mandatory Access Control Kernel. Smack implements mandatory access control (MAC) using labels attached to tasks and data containers, including files, SVIPC, and other tasks. Smack is a kernel based scheme that requires an absolute minimum of application support and a very small amount of configuration data. Smack uses extended attributes and provides a set of general mount options, borrowing technics used elsewhere. Smack uses netlabel for CIPSO labeling. Smack provides a pseudo-filesystem smackfs that is used for manipulation of system Smack attributes. The patch, patches for ls and sshd, a README, a startup script, and x86 binaries for ls and sshd are also available on http://www.schaufler-ca.com Development has been done using Fedora Core 7 in a virtual machine environment and on an old Sony laptop. Smack provides mandatory access controls based on the label attached to a task and the label attached to the object it is attempting to access. Smack labels are deliberately short (1-23 characters) text strings. Single character labels using special characters are reserved for system use. The only operation applied to Smack labels is equality comparison. No wildcards or expressions, regular or otherwise, are used. Smack labels are composed of printable characters and may not include "/". A file always gets the Smack label of the task that created it. Smack defines and uses these labels: "" - pronounced "star" "_" - pronounced "floor" "^" - pronounced "hat" "?" - pronounced "huh" The access rules enforced by Smack are, in order: 1. Any access requested by a task labeled "" is denied. 2. A read or execute access requested by a task labeled "^" is permitted. 3. A read or execute access requested on an object labeled "_" is permitted. 4. Any access requested on an object labeled "*" is permitted. 5. Any access requested by a task on an object with the same label is permitted. 6. Any access requested that is explicitly defined in the loaded rule set is permitted. 7. Any other access is denied. Rules may be explicitly defined by writing subject,object,access triples to /smack/load. Smack rule sets can be easily defined that describe Bell&LaPadula sensitivity, Biba integrity, and a variety of interesting configurations. Smack rule sets can be modified on the fly to accommodate changes in the operating environment or even the time of day. Some practical use cases: Hierarchical levels. The less common of the two usual uses for MLS systems is to define hierarchical levels, often unclassified, confidential, secret, and so on. To set up smack to support this, these rules could be defined: C Unclass rx S C rx S Unclass rx TS S rx TS C rx TS Unclass rx A TS process can read S, C, and Unclass data, but cannot write it. An S process can read C and Unclass. Note that specifying that TS can read S and S can read C does not imply TS can read C, it has to be explicitly stated. Non-hierarchical categories. This is the more common of the usual uses for an MLS system. Since the default rule is that a subject cannot access an object with a different label no access rules are required to implement compartmentalization. A case that the Bell & LaPadula policy does not allow is demonstrated with this Smack access rule: A case that Bell&LaPadula does not allow that Smack does: ESPN ABC r ABC ESPN r On my portable video device I have two applications, one that shows ABC programming and the other ESPN programming. ESPN wants to show me sport stories that show up as news, and ABC will only provide minimal information about a sports story if ESPN is covering it. Each side can look at the other's info, neither can change the other. Neither can see what FOX is up to, which is just as well all things considered. Another case that I especially like: SatData Guard w Guard Publish w A program running with the Guard label opens a UDP socket and accepts messages sent by a program running with a SatData label. The Guard program inspects the message to ensure it is wholesome and if it is sends it to a program running with the Publish label. This program then puts the information passed in an appropriate place. Note that the Guard program cannot write to a Publish file system object because file system semanitic require read as well as write. The four cases (categories, levels, mutual read, guardbox) here are all quite real, and problems I've been asked to solve over the years. The first two are easy to do with traditonal MLS systems while the last two you can't without invoking privilege, at least for a while. Signed-off-by: Casey Schaufler <casey@schaufler-ca.com> Cc: Joshua Brindle <method@manicmethod.com> Cc: Paul Moore <paul.moore@hp.com> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: Chris Wright <chrisw@sous-sol.org> Cc: James Morris <jmorris@namei.org> Cc: "Ahmed S. Darwish" <darwish.07@gmail.com> Cc: Andrew G. Morgan <morgan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:20 -08:00
Serge E. Hallyn	3b7391de67	capabilities: introduce per-process capability bounding set The capability bounding set is a set beyond which capabilities cannot grow. Currently cap_bset is per-system. It can be manipulated through sysctl, but only init can add capabilities. Root can remove capabilities. By default it includes all caps except CAP_SETPCAP. This patch makes the bounding set per-process when file capabilities are enabled. It is inherited at fork from parent. Noone can add elements, CAP_SETPCAP is required to remove them. One example use of this is to start a safer container. For instance, until device namespaces or per-container device whitelists are introduced, it is best to take CAP_MKNOD away from a container. The bounding set will not affect pP and pE immediately. It will only affect pP' and pE' after subsequent exec()s. It also does not affect pI, and exec() does not constrain pI'. So to really start a shell with no way of regain CAP_MKNOD, you would do prctl(PR_CAPBSET_DROP, CAP_MKNOD); cap_t cap = cap_get_proc(); cap_value_t caparray[1]; caparray[0] = CAP_MKNOD; cap_set_flag(cap, CAP_INHERITABLE, 1, caparray, CAP_DROP); cap_set_proc(cap); cap_free(cap); The following test program will get and set the bounding set (but not pI). For instance ./bset get (lists capabilities in bset) ./bset drop cap_net_raw (starts shell with new bset) (use capset, setuid binary, or binary with file capabilities to try to increase caps) ********************************************************** cap_bound.c ********************************************************** #include <sys/prctl.h> #include <linux/capability.h> #include <sys/types.h> #include <unistd.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #ifndef PR_CAPBSET_READ #define PR_CAPBSET_READ 23 #endif #ifndef PR_CAPBSET_DROP #define PR_CAPBSET_DROP 24 #endif int usage(char me) { printf("Usage: %s get\n", me); printf(" %s drop <capability>\n", me); return 1; } #define numcaps 32 char captable[numcaps] = { "cap_chown", "cap_dac_override", "cap_dac_read_search", "cap_fowner", "cap_fsetid", "cap_kill", "cap_setgid", "cap_setuid", "cap_setpcap", "cap_linux_immutable", "cap_net_bind_service", "cap_net_broadcast", "cap_net_admin", "cap_net_raw", "cap_ipc_lock", "cap_ipc_owner", "cap_sys_module", "cap_sys_rawio", "cap_sys_chroot", "cap_sys_ptrace", "cap_sys_pacct", "cap_sys_admin", "cap_sys_boot", "cap_sys_nice", "cap_sys_resource", "cap_sys_time", "cap_sys_tty_config", "cap_mknod", "cap_lease", "cap_audit_write", "cap_audit_control", "cap_setfcap" }; int getbcap(void) { int comma=0; unsigned long i; int ret; printf("i know of %d capabilities\n", numcaps); printf("capability bounding set:"); for (i=0; i<numcaps; i++) { ret = prctl(PR_CAPBSET_READ, i); if (ret < 0) perror("prctl"); else if (ret==1) printf("%s%s", (comma++) ? ", " : " ", captable[i]); } printf("\n"); return 0; } int capdrop(char str) { unsigned long i; int found=0; for (i=0; i<numcaps; i++) { if (strcmp(captable[i], str) == 0) { found=1; break; } } if (!found) return 1; if (prctl(PR_CAPBSET_DROP, i)) { perror("prctl"); return 1; } return 0; } int main(int argc, char argv[]) { if (argc<2) return usage(argv[0]); if (strcmp(argv[1], "get")==0) return getbcap(); if (strcmp(argv[1], "drop")!=0 \|\| argc<3) return usage(argv[0]); if (capdrop(argv[2])) { printf("unknown capability\n"); return 1; } return execl("/bin/bash", "/bin/bash", NULL); } ************************************************************ [serue@us.ibm.com: fix typo] Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Signed-off-by: Andrew G. Morgan <morgan@kernel.org> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: James Morris <jmorris@namei.org> Cc: Chris Wright <chrisw@sous-sol.org> Cc: Casey Schaufler <casey@schaufler-ca.com>a Signed-off-by: "Serge E. Hallyn" <serue@us.ibm.com> Tested-by: Jiri Slaby <jirislaby@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:20 -08:00
Andrew Morgan	46c383cc45	Remove unnecessary include from include/linux/capability.h KaiGai Kohei observed that this line in the linux header is not needed. Signed-off-by: Andrew G. Morgan <morgan@kernel.org> Cc: KaiGai Kohei <kaigai@kaigai.gr.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:20 -08:00
Andrew Morgan	e338d263a7	Add 64-bit capability support to the kernel The patch supports legacy (32-bit) capability userspace, and where possible translates 32-bit capabilities to/from userspace and the VFS to 64-bit kernel space capabilities. If a capability set cannot be compressed into 32-bits for consumption by user space, the system call fails, with -ERANGE. FWIW libcap-2.00 supports this change (and earlier capability formats) http://www.kernel.org/pub/linux/libs/security/linux-privs/kernel-2.6/ [akpm@linux-foundation.org: coding-syle fixes] [akpm@linux-foundation.org: use get_task_comm()] [ezk@cs.sunysb.edu: build fix] [akpm@linux-foundation.org: do not initialise statics to 0 or NULL] [akpm@linux-foundation.org: unused var] [serue@us.ibm.com: export __cap_ symbols] Signed-off-by: Andrew G. Morgan <morgan@kernel.org> Cc: Stephen Smalley <sds@tycho.nsa.gov> Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: Chris Wright <chrisw@sous-sol.org> Cc: James Morris <jmorris@namei.org> Cc: Casey Schaufler <casey@schaufler-ca.com> Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:20 -08:00
Andrew Morton	8f6936f4d2	revert "capabilities: clean up file capability reading" Revert `b68680e473` to make way for the next patch: "Add 64-bit capability support to the kernel". We want to keep the vfs_cap_data.data[] structure, using two 'data's for 64-bit caps (and later three for 96-bit caps), whereas `b68680e473` had gotten rid of the 'data' struct made its members inline. The 64-bit caps patch keeps the stack abuse fix at get_file_caps(), which was the more important part of that patch. [akpm@linux-foundation.org: coding-style fixes] Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: Serge Hallyn <serue@us.ibm.com> Cc: Chris Wright <chrisw@sous-sol.org> Cc: James Morris <jmorris@namei.org> Cc: Casey Schaufler <casey@schaufler-ca.com> Cc: Andrew Morgan <morgan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:20 -08:00
David P. Quigley	4249259404	VFS/Security: Rework inode_getsecurity and callers to return resulting buffer This patch modifies the interface to inode_getsecurity to have the function return a buffer containing the security blob and its length via parameters instead of relying on the calling function to give it an appropriately sized buffer. Security blobs obtained with this function should be freed using the release_secctx LSM hook. This alleviates the problem of the caller having to guess a length and preallocate a buffer for this function allowing it to be used elsewhere for Labeled NFS. The patch also removed the unused err parameter. The conversion is similar to the one performed by Al Viro for the security_getprocattr hook. Signed-off-by: David P. Quigley <dpquigl@tycho.nsa.gov> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: Chris Wright <chrisw@sous-sol.org> Acked-by: James Morris <jmorris@namei.org> Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: Casey Schaufler <casey@schaufler-ca.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:20 -08:00
Fengguang Wu	8bc3be2751	writeback: speed up writeback of big dirty files After making dirty a 100M file, the normal behavior is to start the writeback for all data after 30s delays. But sometimes the following happens instead: - after 30s: ~4M - after 5s: ~4M - after 5s: all remaining 92M Some analyze shows that the internal io dispatch queues goes like this: s_io s_more_io ------------------------- 1) 100M,1K 0 2) 1K 96M 3) 0 96M 1) initial state with a 100M file and a 1K file 2) 4M written, nr_to_write <= 0, so write more 3) 1K written, nr_to_write > 0, no more writes(BUG) nr_to_write > 0 in (3) fools the upper layer to think that data have all been written out. The big dirty file is actually still sitting in s_more_io. We cannot simply splice s_more_io back to s_io as soon as s_io becomes empty, and let the loop in generic_sync_sb_inodes() continue: this may starve newly expired inodes in s_dirty. It is also not an option to draw inodes from both s_more_io and s_dirty, an let the loop go on: this might lead to live locks, and might also starve other superblocks in sync time(well kupdate may still starve some superblocks, that's another bug). We have to return when a full scan of s_io completes. So nr_to_write > 0 does not necessarily mean that "all data are written". This patch introduces a flag writeback_control.more_io to indicate that more io should be done. With it the big dirty file no longer has to wait for the next kupdate invokation 5s later. In sync_sb_inodes() we only set more_io on super_blocks we actually visited. This avoids the interaction between two pdflush deamons. Also in __sync_single_inode() we don't blindly keep requeuing the io if the filesystem cannot progress. Failing to do so may lead to 100% iowait. Tested-by: Mike Snitzer <snitzer@gmail.com> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Cc: Michael Rubin <mrubin@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:19 -08:00
Nick Piggin	0ed361dec3	mm: fix PageUptodate data race After running SetPageUptodate, preceeding stores to the page contents to actually bring it uptodate may not be ordered with the store to set the page uptodate. Therefore, another CPU which checks PageUptodate is true, then reads the page contents can get stale data. Fix this by having an smp_wmb before SetPageUptodate, and smp_rmb after PageUptodate. Many places that test PageUptodate, do so with the page locked, and this would be enough to ensure memory ordering in those places if SetPageUptodate were only called while the page is locked. Unfortunately that is not always the case for some filesystems, but it could be an idea for the future. Also bring the handling of anonymous page uptodateness in line with that of file backed page management, by marking anon pages as uptodate when they _are_ uptodate, rather than when our implementation requires that they be marked as such. Doing allows us to get rid of the smp_wmb's in the page copying functions, which were especially added for anonymous pages for an analogous memory ordering problem. Both file and anonymous pages are handled with the same barriers. FAQ: Q. Why not do this in flush_dcache_page? A. Firstly, flush_dcache_page handles only one side (the smb side) of the ordering protocol; we'd still need smp_rmb somewhere. Secondly, hiding away memory barriers in a completely unrelated function is nasty; at least in the PageUptodate macros, they are located together with (half) the operations involved in the ordering. Thirdly, the smp_wmb is only required when first bringing the page uptodate, wheras flush_dcache_page should be called each time it is written to through the kernel mapping. It is logically the wrong place to put it. Q. Why does this increase my text size / reduce my performance / etc. A. Because it is adding the necessary instructions to eliminate the data-race. Q. Can it be improved? A. Yes, eg. if you were to create a rule that all SetPageUptodate operations run under the page lock, we could avoid the smp_rmb places where PageUptodate is queried under the page lock. Requires audit of all filesystems and at least some would need reworking. That's great you're interested, I'm eagerly awaiting your patches. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:19 -08:00
Bron Gondwana	195cf453d2	mm/page-writeback: highmem_is_dirtyable option Add vm.highmem_is_dirtyable toggle A 32 bit machine with HIGHMEM64 enabled running DCC has an MMAPed file of approximately 2Gb size which contains a hash format that is written randomly by the dbclean process. On 2.6.16 this process took a few minutes. With lowmem only accounting of dirty ratios, this takes about 12 hours of 100% disk IO, all random writes. Include a toggle in /proc/sys/vm/highmem_is_dirtyable which can be set to 1 to add the highmem back to the total available memory count. [akpm@linux-foundation.org: Fix the CONFIG_DETECT_SOFTLOCKUP=y build] Signed-off-by: Bron Gondwana <brong@fastmail.fm> Cc: Ethan Solomita <solo@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: WU Fengguang <wfg@mail.ustc.edu.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:18 -08:00
Christoph Lameter	3dfa5721f1	Page allocator: get rid of the list of cold pages We have repeatedly discussed if the cold pages still have a point. There is one way to join the two lists: Use a single list and put the cold pages at the end and the hot pages at the beginning. That way a single list can serve for both types of allocations. The discussion of the RFC for this and Mel's measurements indicate that there may not be too much of a point left to having separate lists for hot and cold pages (see http://marc.info/?t=119492914200001&r=1&w=2). Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Martin Bligh <mbligh@mbligh.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:18 -08:00
Christoph Lameter	9f8f217253	Page allocator: clean up pcp draining functions - Add comments explaing how drain_pages() works. - Eliminate useless functions - Rename drain_all_local_pages to drain_all_pages(). It does drain all pages not only those of the local processor. - Eliminate useless interrupt off / on sequences. drain_pages() disables interrupts on its own. The execution thread is pinned to processor by the caller. So there is no need to disable interrupts. - Put drain_all_pages() declaration in gfp.h and remove the declarations from suspend.h and from mm/memory_hotplug.c - Make software suspend call drain_all_pages(). The draining of processor local pages is may not the right approach if software suspend wants to support SMP. If they call drain_all_pages then we can make drain_pages() static. [akpm@linux-foundation.org: fix build] Signed-off-by: Christoph Lameter <clameter@sgi.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Daniel Walker <dwalker@mvista.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:17 -08:00
Matt Mackall	f248dcb34d	maps4: move clear_refs code to task_mmu.c This puts all the clear_refs code where it belongs and probably lets things compile on MMU-less systems as well. Signed-off-by: Matt Mackall <mpm@selenic.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: David Rientjes <rientjes@google.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:16 -08:00
Matt Mackall	e6473092bd	maps4: introduce a generic page walker Introduce a general page table walker Signed-off-by: Matt Mackall <mpm@selenic.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:16 -08:00
Matt Mackall	698dd4ba6b	maps4: move is_swap_pte Move is_swap_pte helper function to swapops.h for use by pagemap code Signed-off-by: Matt Mackall <mpm@selenic.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:16 -08:00
Dave Hansen	8245525741	maps4: rework TASK_SIZE macros The following replaces the earlier patches sent. It should address David Rientjes's comments, and has been compile tested on all the architectures that it touches, save for parisc. For the /proc/<pid>/pagemap code[1], we need to able to query how much virtual address space a particular task has. The trick is that we do it through /proc and can't use TASK_SIZE since it references "current" on some arches. The process opening the /proc file might be a 32-bit process opening a 64-bit process's pagemap file. x86_64 already has a TASK_SIZE_OF() macro: #define TASK_SIZE_OF(child) ((test_tsk_thread_flag(child, TIF_IA32)) ? IA32_PAGE_OFFSET : TASK_SIZE64) I'd like to have that for other architectures. So, add it for all the architectures that actually use "current" in their TASK_SIZE. For the others, just add a quick #define in sched.h to use plain old TASK_SIZE. 1. http://www.linuxworld.com/news/2007/042407-kernel.html - MIPS portion from Ralf Baechle <ralf@linux-mips.org> [akpm@linux-foundation.org: fix mips build] Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Matt Mackall <mpm@selenic.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:16 -08:00
Hugh Dickins	73b1262fa4	tmpfs: move swap swizzling into shmem move_to_swap_cache and move_from_swap_cache functions (which swizzle a page between tmpfs page cache and swap cache, to avoid page copying) are only used by shmem.c; and our subsequent fix for unionfs needs different treatments in the two instances of move_from_swap_cache. Move them from swap_state.c into their callsites shmem_writepage, shmem_unuse_inode and shmem_getpage, making add_to_swap_cache externally visible. shmem.c likes to say set_page_dirty where swap_state.c liked to say SetPageDirty: respect that diversity, which __set_page_dirty_no_writeback makes moot (and implies we should lose that "shift page from clean_pages to dirty_pages list" comment: it's on neither). Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:15 -08:00
Hugh Dickins	02098feaa4	swapin needs gfp_mask for loop on tmpfs Building in a filesystem on a loop device on a tmpfs file can hang when swapping, the loop thread caught in that infamous throttle_vm_writeout. In theory this is a long standing problem, which I've either never seen in practice, or long ago suppressed the recollection, after discounting my load and my tmpfs size as unrealistically high. But now, with the new aops, it has become easy to hang on one machine. Loop used to grab_cache_page before the old prepare_write to tmpfs, which seems to have been enough to free up some memory for any swapin needed; but the new write_begin lets tmpfs find or allocate the page (much nicer, since grab_cache_page missed tmpfs pages in swapcache). When allocating a fresh page, tmpfs respects loop's mapping_gfp_mask, which has __GFP_IO\|__GFP_FS stripped off, and throttle_vm_writeout is designed to break out when __GFP_IO or GFP_FS is unset; but when tmfps swaps in, read_swap_cache_async allocates with GFP_HIGHUSER_MOVABLE regardless of the mapping_gfp_mask - hence the hang. So, pass gfp_mask down the line from shmem_getpage to shmem_swapin to swapin_readahead to read_swap_cache_async to add_to_swap_cache. Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:14 -08:00
Hugh Dickins	46017e9548	swapin_readahead: move and rearrange args swapin_readahead has never sat well in mm/memory.c: move it to mm/swap_state.c beside its kindred read_swap_cache_async. Why were its args in a different order? rearrange them. And since it was always followed by a read_swap_cache_async of the target page, fold that in and return struct page*. Then CONFIG_SWAP=n no longer needs valid_swaphandles and read_swap_cache_async stubs. Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:14 -08:00
Christoph Lameter	aec2c3ed01	VM: allow get_page_unless_zero on compound pages Both slab defrag and the large blocksize patches need to ability to take refcounts on compound pages. May be useful in other places as well. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:14 -08:00
Christoph Lameter	9e2779fa28	is_vmalloc_addr(): Check if an address is within the vmalloc boundaries Checking if an address is a vmalloc address is done in a couple of places. Define a common version in mm.h and replace the other checks. Again the include structures suck. The definition of VMALLOC_START and VMALLOC_END is not available in vmalloc.h since highmem.c cannot be included there. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:14 -08:00
Christoph Lameter	b3bdda02aa	vmalloc: add const to void* parameters Make vmalloc functions work the same way as kfree() and friends that take a const void * argument. [akpm@linux-foundation.org: fix consts, coding-style] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:14 -08:00
Christoph Lameter	48667e7a43	Move vmalloc_to_page() to mm/vmalloc. We already have page table manipulation for vmalloc in vmalloc.c. Move the vmalloc_to_page() function there as well. Move the definitions for vmalloc related functions in mm.h to a newly created section. A better place would be vmalloc.h but mm.h is basic and may depend on these functions. An alternative would be to include vmalloc.h in mm.h (like done for vmstat.h). Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:13 -08:00

1 2 3 4 5 ...

9245 commits