search
Include:
The following results are related to Canada. Are you interested to view more results? Visit OpenAIRE - Explore.
1,703 Research products, page 1 of 171

  • Canada
  • 020202 computer hardware & architecture

10
arrow_drop_down
Relevance
arrow_drop_down
  • Authors: 
    Wade Penson; Eric Huang; Dana Klamut; Eliana Wardle; Graeme Douglas; Scott Fazackerley; Ramon Lawrence;
    Publisher: IEEE

    Software development for embedded systems is challenging due to hardware resource limitations and complexities in testing and verification. Although there are numerous approaches and tools, the integration and deployment of these tools for a particular software development project requires effort. The contribution of this work is a platform for continuous integration that adapts common open source testing software for enterprise development for use with Arduinos. A description of a software development workflow that utilizes the platform is provided as well as its specific application to developing a database library for embedded systems.

  • Publication . Conference object . 2019
    Closed Access
    Authors: 
    Elizabeth Adams; Suganthi Venkatachalam; Seok-Bum Ko;
    Publisher: IEEE

    Inexact computing generally involves trading a reduction in accuracy for an improvement in circuit area and power-consumption. The multiply-accumulate (MAC) operation is used extensively in convolutional neural networks and such applications stand to benefit greatly from the introduction of approximation to the MAC operation. This paper introduces an unsigned approximate MAC unit architecture in which approximation is introduced to both the multiplication and accumulation stages. Four variations of the proposed design are implemented using the TSMC 65 nm technology and are used in an image smoothing application. The proposed architecture is compared to that of the exact MAC unit and is shown to reduce circuit area and power-consumption by 67% and 49% respectively. When compared to other approximate MAC architectures, the proposed design improves area-power product by up to 66%.

  • Open Access
    Authors: 
    Karel Culik; Ivan Friš;
    Publisher: Elsevier BV
    Project: NSERC

    Abstract We introduce the notion of computational network (CN) which is a general model of an arbitrary (finite or infinite) system of parallel synchronized processors (systolic network). Our basic and very useful tools are topological transformations of the space-time diagrams (unrollings) of computations on CN. We show that the topological transformations on unrollings can be used to design systolic networks, to give simple proofs of their correctness, and to demonstrate the equivalence of different networks. For example, we usedthe transformation technique to give a concise proof of a strengthened version of Leiserson's and Saxe's Retiming lemma and Systolic Conversion Theorem. As a practical application we show the correctness of a simple algorithm for distributed sorting on a systolic ring. Many other examples are given.

  • Publication . Conference object . 2019
    Closed Access
    Authors: 
    Jiaqiang Li; Pedro Reviriego; C. Argyrides; Liyi Xiao;
    Publisher: IEEE

    In the last decade, a number of Single Error Correction Double Adjacent Error Correction (SEC-DAEC) codes have been proposed to protect memories against Multiple Cell Upsets (MCUs). These codes are able to correct errors that affect two adjacent bits that is one of the most common MCU patterns. However, soft errors can also affect the encoder and decoder circuitry creating data corruption. An alternative to protect the encoders is to use parity prediction Concurrent Error Detection (CED) to detect errors and avoid writing erroneous words in the memory. This approach has been previously studied for Orthogonal Latin Square (OLS) codes and for matrix codes. In this paper, the implementation of parity prediction Concurrent Error Detection (CED) for SEC-DAEC codes is considered. To that end, first it is shown that CED has a significant cost for the existing SEC-DAEC codes. This is because they are odd weight codes and parity prediction is much simpler for even weight codes. Based on that observation, even weight SEC-DAEC codes are designed and evaluated. The results show that CED can be efficiently implemented in the proposed codes that achieve a significant reduction in encoder circuit complexity compared to previously proposed SEC-DAEC codes.

  • Closed Access
    Authors: 
    Henry Wong; Vaughn Betz; Jonathan Rose;
    Publisher: Association for Computing Machinery (ACM)

    Although FPGAs have grown in capacity, FPGA-based soft processors have grown very little because of the difficulty of achieving higher performance in exchange for area. Superscalar out-of-order processors promise large performance gains, and the memory subsystem is a key part of such a processor that must help supply increased performance. In this article, we describe and explore microarchitectural and circuit-level tradeoffs in the design of such a memory system. We show the significant instructions-per-cycle wins for providing various levels of out-of-order memory access and memory dependence speculation (1.32 × SPECint2000) and for the addition of a second-level cache (another 1.60 × ). With careful microarchitecture and circuit design, we also achieve a L1 translation lookaside buffers and cache lookup with 29% less logic delay than the simpler Nios II/f memory system.

  • Open Access
    Authors: 
    Monowar Hasan; Sibin Mohan; Rakesh B. Bobba; Rodolfo Pellizzoni;
    Publisher: IEEE

    Due to physical isolation as well as use of proprietary hardware and protocols, traditional real-time systems (RTS) were considered to be invulnerable to security breaches and external attacks. This assumption is being challenged by recent attacks that highlight vulnerabilities in RTS. Besides, a straightforward integration of security mechanisms might compromise the safety and predictability guarantees of such systems. In this paper, we focus on integrating security mechanisms into RTS (especially legacy RTS) and define a metric to measure the effectiveness of such integration. We combine opportunistic execution with hierarchical scheduling to maintain compatibility with legacy systems while still providing flexibility. The proposed approach is shown to increase the security posture of RTS without impacting their temporal (and hence, safety) constraints.

  • Publication . Conference object . 2018
    Open Access
    Authors: 
    Mahmoud Masadeh; Osman Hasan; Sofiène Tahar;
    Publisher: ACM

    Approximate multipliers are widely being advocated for energy-efficient computing in applications that exhibit an inherent tolerance to inaccuracy. In this paper, we identify three decisions for design and evaluation of approximate multiplier circuits: (1) the type of approximate full adder (FA) used to construct the multiplier, (2) the architecture, i.e., array or tree, of the multiplier and (3) the placement of sub-modules of approximate and exact multipliers in the target multiplier module. Based on FA cells implemented at the transistor level (TSMC65nm), we developed several approximate building blocks of 8x8 multipliers, as well as various implementations of higher order multipliers. These designs are evaluated based on their power, area, delay and error and the best designs are identified. We validate these designs on an image blending application using MATLAB, and compare them to related work.

  • Authors: 
    Haonan Wang; Fan Luo; Mohamed Ibrahim; Onur Kayiran; Adwait Jog;
    Publisher: IEEE

    Managing the thread-level parallelism (TLP) of GPGPU applications by limiting it to a certain degree is known to be effective in improving the overall performance. However, we find that such prior techniques can lead to sub-optimal system throughput and fairness when two or more applications are co-scheduled on the same GPU. It is because they attempt to maximize the performance of individual applications in isolation, ultimately allowing each application to take a disproportionate amount of shared resources. This leads to high contention in shared cache and memory. To address this problem, we propose new application-aware TLP management techniques for a multi-application execution environment such that all co-scheduled applications can make good and judicious use of all the shared resources. For measuring such use, we propose an application-level utility metric, called effective bandwidth, which accounts for two runtime metrics: attained DRAM bandwidth and cache miss rates. We find that maximizing the total effective bandwidth and doing so in a balanced fashion across all co-located applications can significantly improve the system throughput and fairness. Instead of exhaustively searching across all the different combinations of TLP configurations that achieve these goals, we find that a significant amount of overhead can be reduced by taking advantage of the trends, which we call patterns, in the way application's effective bandwidth changes with different TLP combinations. Our proposed pattern-based TLP management mechanisms improve the system throughput and fairness by 20% and 2x, respectively, over a baseline where each application executes with a TLP configuration that provides the best performance when it executes alone.

  • Authors: 
    Shuli Gao; Dhamin Al-Khalili; J. M. Pierre Langlois; Noureddine Chabini;
    Publisher: IEEE

    This paper presents the design of pipelined IEEE 754-2008 decimal floating-point (DFP) multipliers targeting FPGAs. A key component of the architecture is the fixed-point multiplier function which impacts the overall performance and area utilization. In this paper, we propose a new method to realize this operation by carefully organizing the partial products and developing an algorithm for binary-decimal compression. The DFP multipliers with 5 to 12 pipeline stages are coded in VHDL and implemented on a Xilinx Virtex-5 FPGA. The overall design is compared with another approach based on fixed-point multipliers using a BCD-4221 compression technique. Using post layout extracted design data, our approach achieves a delay improvement in the range of 7.9% to 20.3% and an average LUT reduction of 5%.

  • Closed Access
    Authors: 
    Jeongwon Seo; Mingwei Gong; Ranesh Kumar Naha; Aniket Mahanti;
    Publisher: IEEE

    This paper aims to develop a real-time Plant Environment Simulator (PES), which simulates a corrugated plant effectively and realistically. The resultant solution of this work can be used to provide factory workers or new developers with a responsive, simulated learning environment on teaching how to use existing software correctly. The work is carried out for a large cardbox maker that can be used to test new prototypes without using the actual plant facilities, so it will economically and efficiently contribute to the creation of new robust software products for the corrugated plant.

search
Include:
The following results are related to Canada. Are you interested to view more results? Visit OpenAIRE - Explore.
1,703 Research products, page 1 of 171
  • Authors: 
    Wade Penson; Eric Huang; Dana Klamut; Eliana Wardle; Graeme Douglas; Scott Fazackerley; Ramon Lawrence;
    Publisher: IEEE

    Software development for embedded systems is challenging due to hardware resource limitations and complexities in testing and verification. Although there are numerous approaches and tools, the integration and deployment of these tools for a particular software development project requires effort. The contribution of this work is a platform for continuous integration that adapts common open source testing software for enterprise development for use with Arduinos. A description of a software development workflow that utilizes the platform is provided as well as its specific application to developing a database library for embedded systems.

  • Publication . Conference object . 2019
    Closed Access
    Authors: 
    Elizabeth Adams; Suganthi Venkatachalam; Seok-Bum Ko;
    Publisher: IEEE

    Inexact computing generally involves trading a reduction in accuracy for an improvement in circuit area and power-consumption. The multiply-accumulate (MAC) operation is used extensively in convolutional neural networks and such applications stand to benefit greatly from the introduction of approximation to the MAC operation. This paper introduces an unsigned approximate MAC unit architecture in which approximation is introduced to both the multiplication and accumulation stages. Four variations of the proposed design are implemented using the TSMC 65 nm technology and are used in an image smoothing application. The proposed architecture is compared to that of the exact MAC unit and is shown to reduce circuit area and power-consumption by 67% and 49% respectively. When compared to other approximate MAC architectures, the proposed design improves area-power product by up to 66%.

  • Open Access
    Authors: 
    Karel Culik; Ivan Friš;
    Publisher: Elsevier BV
    Project: NSERC

    Abstract We introduce the notion of computational network (CN) which is a general model of an arbitrary (finite or infinite) system of parallel synchronized processors (systolic network). Our basic and very useful tools are topological transformations of the space-time diagrams (unrollings) of computations on CN. We show that the topological transformations on unrollings can be used to design systolic networks, to give simple proofs of their correctness, and to demonstrate the equivalence of different networks. For example, we usedthe transformation technique to give a concise proof of a strengthened version of Leiserson's and Saxe's Retiming lemma and Systolic Conversion Theorem. As a practical application we show the correctness of a simple algorithm for distributed sorting on a systolic ring. Many other examples are given.

  • Publication . Conference object . 2019
    Closed Access
    Authors: 
    Jiaqiang Li; Pedro Reviriego; C. Argyrides; Liyi Xiao;
    Publisher: IEEE

    In the last decade, a number of Single Error Correction Double Adjacent Error Correction (SEC-DAEC) codes have been proposed to protect memories against Multiple Cell Upsets (MCUs). These codes are able to correct errors that affect two adjacent bits that is one of the most common MCU patterns. However, soft errors can also affect the encoder and decoder circuitry creating data corruption. An alternative to protect the encoders is to use parity prediction Concurrent Error Detection (CED) to detect errors and avoid writing erroneous words in the memory. This approach has been previously studied for Orthogonal Latin Square (OLS) codes and for matrix codes. In this paper, the implementation of parity prediction Concurrent Error Detection (CED) for SEC-DAEC codes is considered. To that end, first it is shown that CED has a significant cost for the existing SEC-DAEC codes. This is because they are odd weight codes and parity prediction is much simpler for even weight codes. Based on that observation, even weight SEC-DAEC codes are designed and evaluated. The results show that CED can be efficiently implemented in the proposed codes that achieve a significant reduction in encoder circuit complexity compared to previously proposed SEC-DAEC codes.

  • Closed Access
    Authors: 
    Henry Wong; Vaughn Betz; Jonathan Rose;
    Publisher: Association for Computing Machinery (ACM)

    Although FPGAs have grown in capacity, FPGA-based soft processors have grown very little because of the difficulty of achieving higher performance in exchange for area. Superscalar out-of-order processors promise large performance gains, and the memory subsystem is a key part of such a processor that must help supply increased performance. In this article, we describe and explore microarchitectural and circuit-level tradeoffs in the design of such a memory system. We show the significant instructions-per-cycle wins for providing various levels of out-of-order memory access and memory dependence speculation (1.32 × SPECint2000) and for the addition of a second-level cache (another 1.60 × ). With careful microarchitecture and circuit design, we also achieve a L1 translation lookaside buffers and cache lookup with 29% less logic delay than the simpler Nios II/f memory system.

  • Open Access
    Authors: 
    Monowar Hasan; Sibin Mohan; Rakesh B. Bobba; Rodolfo Pellizzoni;
    Publisher: IEEE

    Due to physical isolation as well as use of proprietary hardware and protocols, traditional real-time systems (RTS) were considered to be invulnerable to security breaches and external attacks. This assumption is being challenged by recent attacks that highlight vulnerabilities in RTS. Besides, a straightforward integration of security mechanisms might compromise the safety and predictability guarantees of such systems. In this paper, we focus on integrating security mechanisms into RTS (especially legacy RTS) and define a metric to measure the effectiveness of such integration. We combine opportunistic execution with hierarchical scheduling to maintain compatibility with legacy systems while still providing flexibility. The proposed approach is shown to increase the security posture of RTS without impacting their temporal (and hence, safety) constraints.

  • Publication . Conference object . 2018
    Open Access
    Authors: 
    Mahmoud Masadeh; Osman Hasan; Sofiène Tahar;
    Publisher: ACM

    Approximate multipliers are widely being advocated for energy-efficient computing in applications that exhibit an inherent tolerance to inaccuracy. In this paper, we identify three decisions for design and evaluation of approximate multiplier circuits: (1) the type of approximate full adder (FA) used to construct the multiplier, (2) the architecture, i.e., array or tree, of the multiplier and (3) the placement of sub-modules of approximate and exact multipliers in the target multiplier module. Based on FA cells implemented at the transistor level (TSMC65nm), we developed several approximate building blocks of 8x8 multipliers, as well as various implementations of higher order multipliers. These designs are evaluated based on their power, area, delay and error and the best designs are identified. We validate these designs on an image blending application using MATLAB, and compare them to related work.

  • Authors: 
    Haonan Wang; Fan Luo; Mohamed Ibrahim; Onur Kayiran; Adwait Jog;
    Publisher: IEEE

    Managing the thread-level parallelism (TLP) of GPGPU applications by limiting it to a certain degree is known to be effective in improving the overall performance. However, we find that such prior techniques can lead to sub-optimal system throughput and fairness when two or more applications are co-scheduled on the same GPU. It is because they attempt to maximize the performance of individual applications in isolation, ultimately allowing each application to take a disproportionate amount of shared resources. This leads to high contention in shared cache and memory. To address this problem, we propose new application-aware TLP management techniques for a multi-application execution environment such that all co-scheduled applications can make good and judicious use of all the shared resources. For measuring such use, we propose an application-level utility metric, called effective bandwidth, which accounts for two runtime metrics: attained DRAM bandwidth and cache miss rates. We find that maximizing the total effective bandwidth and doing so in a balanced fashion across all co-located applications can significantly improve the system throughput and fairness. Instead of exhaustively searching across all the different combinations of TLP configurations that achieve these goals, we find that a significant amount of overhead can be reduced by taking advantage of the trends, which we call patterns, in the way application's effective bandwidth changes with different TLP combinations. Our proposed pattern-based TLP management mechanisms improve the system throughput and fairness by 20% and 2x, respectively, over a baseline where each application executes with a TLP configuration that provides the best performance when it executes alone.

  • Authors: 
    Shuli Gao; Dhamin Al-Khalili; J. M. Pierre Langlois; Noureddine Chabini;
    Publisher: IEEE

    This paper presents the design of pipelined IEEE 754-2008 decimal floating-point (DFP) multipliers targeting FPGAs. A key component of the architecture is the fixed-point multiplier function which impacts the overall performance and area utilization. In this paper, we propose a new method to realize this operation by carefully organizing the partial products and developing an algorithm for binary-decimal compression. The DFP multipliers with 5 to 12 pipeline stages are coded in VHDL and implemented on a Xilinx Virtex-5 FPGA. The overall design is compared with another approach based on fixed-point multipliers using a BCD-4221 compression technique. Using post layout extracted design data, our approach achieves a delay improvement in the range of 7.9% to 20.3% and an average LUT reduction of 5%.

  • Closed Access
    Authors: 
    Jeongwon Seo; Mingwei Gong; Ranesh Kumar Naha; Aniket Mahanti;
    Publisher: IEEE

    This paper aims to develop a real-time Plant Environment Simulator (PES), which simulates a corrugated plant effectively and realistically. The resultant solution of this work can be used to provide factory workers or new developers with a responsive, simulated learning environment on teaching how to use existing software correctly. The work is carried out for a large cardbox maker that can be used to test new prototypes without using the actual plant facilities, so it will economically and efficiently contribute to the creation of new robust software products for the corrugated plant.